英文:
How to resampling o multidimensional event without losing information
问题
I have a dataset representing a time series
这是一个代表时间序列的数据集
The time series has 165 events
时间序列有165个事件
Each unique event has 15 rows, FX1, FX2, … FX15
每个独特事件有15行,FX1,FX2,... FX15
The time between each row is 1 minute so one event takes 15 minutes
每行之间的时间间隔为1分钟,所以一个事件需要15分钟
Each row has 6 features F1,F2,F3,T1,T2,T3
每行有6个特征F1、F2、F3、T1、T2、T3
All values are numeric and integers values
所有的值都是数值和整数值
Here is an example of an event :
以下是一个事件的示例:
This event started at 2000/01/01 00:00:00 and ended at 2000/01/01 00:15:00
该事件从2000/01/01 00:00:00开始,结束于2000/01/01 00:15:00
The first row in the event has F1=3,F2=-4,…T3=45
该事件中的第一行具有F1=3,F2=-4,...,T3=45
My question is this
我的问题是这样的
Can I convert one event to just one row without losing information?
我能将一个事件转换为只有一行而不丢失信息吗?
Let's say I use Panda data frame resample() method for downsampling
假设我使用Panda数据框的resample()方法进行降采样
Think of it like resampling a dataset in this case resampling to 15 minutes will create a unique row I think but will the resulting unidimensional data lack any information?
可以将其视为对数据集进行重新采样,在这种情况下,将重新采样为15分钟将创建一个唯一的行,但最终的一维数据会缺少任何信息吗?
So the desired output for the above data frame will be something like this
因此,上述数据框的期望输出将类似于以下内容:
time F1 F2 F3 T1 T2 T3
2000/01/01 00:01:00 XF1 XF2 XF3 xT1 xT2 xT3
其中XF1是最能代表F1从0到14(94、-2、-3、39、80、-19、79、....8)的值的值。
英文:
I have a dataset representing a time series
The time series has 165 events
Each unique event has 15 rows, FX1, FX2, … FX15
The time between each row is 1 minute so one event takes 15 minutes
Each row has 6 features F1,F2,F3,T1,T2,T3
All values are numeric and integers values
Here is an example of an event :
This event started at 2000/01/01 00:00:00 and ended at 2000/01/01 00:15:00
The first row in the event has F1=3,F2=-4,…T3=45
time F1 F2 F3 T1 T2 T3
0 2000/01/01 00:01:00 94 -76 0 47 9 -20
1 2000/01/01 00:02:00 -2 85 14 79 92 -95
2 2000/01/01 00:03:00 -3 13 -100 33 74 -43
3 2000/01/01 00:04:00 39 64 -29 32 -73 -44
4 2000/01/01 00:05:00 80 44 3 73 56 -100
5 2000/01/01 00:06:00 -19 -51 -77 32 72 24
6 2000/01/01 00:07:00 79 -69 -87 4 20 19
7 2000/01/01 00:08:00 68 6 95 -76 34 58
8 2000/01/01 00:09:00 26 -59 24 79 -43 48
9 2000/01/01 00:10:00 71 8 -85 -15 -45 -56
10 2000/01/01 00:11:00 51 98 6 -53 -39 5
11 2000/01/01 00:12:00 99 73 -48 -1 64 56
12 2000/01/01 00:13:00 -12 13 -63 51 36 95
13 2000/01/01 00:14:00 8 -100 54 91 -56 -32
My question is this
Can I convert one event to just one row without losing information?
Let's say I use Panda data frame resample() method for downsampling
Think of it like resampling a dataset in this case resampling to 15 minutes will create a unique row I think but will the resulting unidimensional data lack any information?
So the desired output for the above data frame will be something like this
time F1 F2 F3 T1 T2 T3
2000/01/01 00:01:00 XF1 XF2 XF3 xT1 xT2 xT3
Where XF1 is the value that best represents the array of 14 values that F1 took from 0 to 14 ( 94, -2, -3, 39, 80, -19, 79, ....8).
答案1
得分: 1
以下是翻译好的部分:
使用以下玩具数据框:
import random
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"time": [f"2000/01/01 00:{i:02}:00" for i in range(1, 60)]}
| {
key: [random.randrange(-100, 100) for _ in range(1, 60)]
for key in ("F1", "F2", "F3", "T1", "T2", "T3")
}
)
print(df.head(30))
# 输出
time F1 F2 F3 T1 T2 T3
0 2000/01/01 00:01:00 44 -91 5 -23 97 14
1 2000/01/01 00:02:00 -38 46 -14 -5 -66 39
2 2000/01/01 00:03:00 70 -63 -28 -53 53 77
3 2000/01/01 00:04:00 33 -16 82 98 54 95
4 2000/01/01 00:05:00 -51 -89 -52 -88 -68 -61
5 2000/01/01 00:06:00 -64 69 25 -98 21 63
6 2000/01/01 00:07:00 -52 51 -34 35 -47 83
7 2000/01/01 00:08:00 -10 10 -87 -49 75 7
8 2000/01/01 00:09:00 -51 -95 25 -49 -43 -13
9 2000/01/01 00:10:00 -16 88 -23 -3 -17 71
10 2000/01/01 00:11:00 4 -97 3 53 -35 -83
11 2000/01/01 00:12:00 -94 -17 -88 -5 41 60
12 2000/01/01 00:13:00 91 -14 43 79 -8 14
13 2000/01/01 00:14:00 94 -1 -57 7 -21 91
14 2000/01/01 00:15:00 -60 -2 39 -56 -61 24
15 2000/01/01 00:16:00 -20 -83 30 68 -97 -87
16 2000/01/01 00:17:00 7 70 -65 49 13 -66
17 2000/01/01 00:18:00 29 -70 78 84 -80 -5
18 2000/01/01 00:19:00 57 -57 -78 -75 29 -12
19 2000/01/01 00:20:00 -1 -48 -91 89 25 88
20 2000/01/01 00:21:00 -60 -90 6 34 -77 34
21 2000/01/01 00:22:00 -28 7 -33 -64 42 56
22 2000/01/01 00:23:00 -29 85 45 29 -20 -38
23 2000/01/01 00:24:00 40 -26 17 18 50 -100
24 2000/01/01 00:25:00 -74 60 -50 -3 81 -91
25 2000/01/01 00:26:00 35 47 -90 19 48 -47
26 2000/01/01 00:27:00 -32 34 -43 33 26 26
27 2000/01/01 00:28:00 74 12 -11 -97 -20 -29
28 2000/01/01 00:29:00 58 -90 -7 -88 29 -89
29 2000/01/01 00:30:00 39 -51 -88 -94 -26 -27
以下是一种方法:
df["time"] = pd.to_datetime(df["time"], format="%Y/%m/%d %H:%M:%S")
new_df = (
df.set_index("time").resample("15T").agg(lambda x: int(np.mean(x)))
) # 15分钟重采样
然后:
print(new_df)
# 输出
F1 F2 F3 T1 T2 T3
time
2000-01-01 00:00:00 -2 -15 -14 -7 2 32
2000-01-01 00:15:00 0 -10 -16 2 0 -22
2000-01-01 00:30:00 7 -3 -13 -12 0 24
2000-01-01 00:45:00 -1 0 -5 -23 -12 -23
请注意,没有更多上下文的情况下,很难确定重采样值的最佳表示方式,因此我选择了均值,但您可以使用更合适的lambda函数替换它。
另外,我认为无法避免一些信息的丢失,因为重采样/聚合必然会有一些代价。
英文:
With the following toy dataframe:
import random
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"time": [f"2000/01/01 00:{i:02}:00" for i in range(1, 60)]}
| {
key: [random.randrange(-100, 100) for _ in range(1, 60)]
for key in ("F1", "F2", "F3", "T1", "T2", "T3")
}
)
print(df.head(30))
# Output
time F1 F2 F3 T1 T2 T3
0 2000/01/01 00:01:00 44 -91 5 -23 97 14
1 2000/01/01 00:02:00 -38 46 -14 -5 -66 39
2 2000/01/01 00:03:00 70 -63 -28 -53 53 77
3 2000/01/01 00:04:00 33 -16 82 98 54 95
4 2000/01/01 00:05:00 -51 -89 -52 -88 -68 -61
5 2000/01/01 00:06:00 -64 69 25 -98 21 63
6 2000/01/01 00:07:00 -52 51 -34 35 -47 83
7 2000/01/01 00:08:00 -10 10 -87 -49 75 7
8 2000/01/01 00:09:00 -51 -95 25 -49 -43 -13
9 2000/01/01 00:10:00 -16 88 -23 -3 -17 71
10 2000/01/01 00:11:00 4 -97 3 53 -35 -83
11 2000/01/01 00:12:00 -94 -17 -88 -5 41 60
12 2000/01/01 00:13:00 91 -14 43 79 -8 14
13 2000/01/01 00:14:00 94 -1 -57 7 -21 91
14 2000/01/01 00:15:00 -60 -2 39 -56 -61 24
15 2000/01/01 00:16:00 -20 -83 30 68 -97 -87
16 2000/01/01 00:17:00 7 70 -65 49 13 -66
17 2000/01/01 00:18:00 29 -70 78 84 -80 -5
18 2000/01/01 00:19:00 57 -57 -78 -75 29 -12
19 2000/01/01 00:20:00 -1 -48 -91 89 25 88
20 2000/01/01 00:21:00 -60 -90 6 34 -77 34
21 2000/01/01 00:22:00 -28 7 -33 -64 42 56
22 2000/01/01 00:23:00 -29 85 45 29 -20 -38
23 2000/01/01 00:24:00 40 -26 17 18 50 -100
24 2000/01/01 00:25:00 -74 60 -50 -3 81 -91
25 2000/01/01 00:26:00 35 47 -90 19 48 -47
26 2000/01/01 00:27:00 -32 34 -43 33 26 26
27 2000/01/01 00:28:00 74 12 -11 -97 -20 -29
28 2000/01/01 00:29:00 58 -90 -7 -88 29 -89
29 2000/01/01 00:30:00 39 -51 -88 -94 -26 -27
Here is one way to do it:
df["time"] = pd.to_datetime(df["time"], format="%Y/%m/%d %H:%M:%S")
new_df = (
df.set_index("time").resample("15T").agg(lambda x: int(np.mean(x)))
) # 15 min. resampling
Then:
print(new_df)
# Output
F1 F2 F3 T1 T2 T3
time
2000-01-01 00:00:00 -2 -15 -14 -7 2 32
2000-01-01 00:15:00 0 -10 -16 2 0 -22
2000-01-01 00:30:00 7 -3 -13 -12 0 24
2000-01-01 00:45:00 -1 0 -5 -23 -12 -23
Without more context, it's impossible to determine what would be the best representation of the resampled values, so I chose the mean value, but you can replace the lambda function with anything more suitable.
Also, I don't think you will be able to avoid losing some information, as resampling/aggregating necessarily comes at a cost.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论