如何在不丢失信息的情况下对多维事件进行重新采样

huangapple go评论63阅读模式
英文:

How to resampling o multidimensional event without losing information

问题

I have a dataset representing a time series
这是一个代表时间序列的数据集

The time series has 165 events
时间序列有165个事件

Each unique event has 15 rows, FX1, FX2, … FX15
每个独特事件有15行,FX1,FX2,... FX15

The time between each row is 1 minute so one event takes 15 minutes
每行之间的时间间隔为1分钟,所以一个事件需要15分钟

Each row has 6 features F1,F2,F3,T1,T2,T3
每行有6个特征F1、F2、F3、T1、T2、T3

All values are numeric and integers values
所有的值都是数值和整数值

Here is an example of an event :
以下是一个事件的示例:

This event started at 2000/01/01 00:00:00 and ended at 2000/01/01 00:15:00
该事件从2000/01/01 00:00:00开始,结束于2000/01/01 00:15:00

The first row in the event has F1=3,F2=-4,…T3=45
该事件中的第一行具有F1=3,F2=-4,...,T3=45

My question is this
我的问题是这样的

Can I convert one event to just one row without losing information?
我能将一个事件转换为只有一行而不丢失信息吗?

Let's say I use Panda data frame resample() method for downsampling
假设我使用Panda数据框的resample()方法进行降采样

Think of it like resampling a dataset in this case resampling to 15 minutes will create a unique row I think but will the resulting unidimensional data lack any information?
可以将其视为对数据集进行重新采样,在这种情况下,将重新采样为15分钟将创建一个唯一的行,但最终的一维数据会缺少任何信息吗?

So the desired output for the above data frame will be something like this
因此,上述数据框的期望输出将类似于以下内容:

                time  F1   F2   F3   T1   T2    T3
2000/01/01 00:01:00  XF1  XF2  XF3  xT1  xT2  xT3

其中XF1是最能代表F1从0到14(94、-2、-3、39、80、-19、79、....8)的值的值。

英文:

I have a dataset representing a time series
The time series has 165 events

Each unique event has 15 rows, FX1, FX2, … FX15

The time between each row is 1 minute so one event takes 15 minutes

Each row has 6 features F1,F2,F3,T1,T2,T3
All values are numeric and integers values

Here is an example of an event :

This event started at 2000/01/01 00:00:00 and ended at 2000/01/01 00:15:00

The first row in the event has F1=3,F2=-4,…T3=45


                   time  F1   F2   F3  T1  T2   T3
0   2000/01/01 00:01:00  94  -76    0  47   9  -20
1   2000/01/01 00:02:00  -2   85   14  79  92  -95
2   2000/01/01 00:03:00  -3   13 -100  33  74  -43
3   2000/01/01 00:04:00  39   64  -29  32 -73  -44
4   2000/01/01 00:05:00  80   44    3  73  56 -100
5   2000/01/01 00:06:00 -19  -51  -77  32  72   24
6   2000/01/01 00:07:00  79  -69  -87   4  20   19
7   2000/01/01 00:08:00  68    6   95 -76  34   58
8   2000/01/01 00:09:00  26  -59   24  79 -43   48
9   2000/01/01 00:10:00  71    8  -85 -15 -45  -56
10  2000/01/01 00:11:00  51   98    6 -53 -39    5
11  2000/01/01 00:12:00  99   73  -48  -1  64   56
12  2000/01/01 00:13:00 -12   13  -63  51  36   95
13  2000/01/01 00:14:00   8 -100   54  91 -56  -32

My question is this

Can I convert one event to just one row without losing information?

Let's say I use Panda data frame resample() method for downsampling

Think of it like resampling a dataset in this case resampling to 15 minutes will create a unique row I think but will the resulting unidimensional data lack any information?

So the desired output for the above data frame will be something like this

                 time  F1   F2   F3   T1   T2     T3
 2000/01/01 00:01:00  XF1  XF2  XF3  xT1  xT2    xT3

Where XF1 is the value that best represents the array of 14 values that F1 took from 0 to 14 ( 94, -2, -3, 39, 80, -19, 79, ....8).

答案1

得分: 1

以下是翻译好的部分:

使用以下玩具数据框:

import random
import numpy as np
import pandas as pd

df = pd.DataFrame(
    {"time": [f"2000/01/01 00:{i:02}:00" for i in range(1, 60)]}
    | {
        key: [random.randrange(-100, 100) for _ in range(1, 60)]
        for key in ("F1", "F2", "F3", "T1", "T2", "T3")
    }
)

print(df.head(30))

# 输出
                   time  F1  F2  F3  T1  T2   T3
0   2000/01/01 00:01:00  44 -91   5 -23  97   14
1   2000/01/01 00:02:00 -38  46 -14  -5 -66   39
2   2000/01/01 00:03:00  70 -63 -28 -53  53   77
3   2000/01/01 00:04:00  33 -16  82  98  54   95
4   2000/01/01 00:05:00 -51 -89 -52 -88 -68  -61
5   2000/01/01 00:06:00 -64  69  25 -98  21   63
6   2000/01/01 00:07:00 -52  51 -34  35 -47   83
7   2000/01/01 00:08:00 -10  10 -87 -49  75    7
8   2000/01/01 00:09:00 -51 -95  25 -49 -43  -13
9   2000/01/01 00:10:00 -16  88 -23  -3 -17   71
10  2000/01/01 00:11:00   4 -97   3  53 -35  -83
11  2000/01/01 00:12:00 -94 -17 -88  -5  41   60
12  2000/01/01 00:13:00  91 -14  43  79  -8   14
13  2000/01/01 00:14:00  94  -1 -57   7 -21   91
14  2000/01/01 00:15:00 -60  -2  39 -56 -61   24
15  2000/01/01 00:16:00 -20 -83  30  68 -97  -87
16  2000/01/01 00:17:00   7  70 -65  49  13  -66
17  2000/01/01 00:18:00  29 -70  78  84 -80   -5
18  2000/01/01 00:19:00  57 -57 -78 -75  29  -12
19  2000/01/01 00:20:00  -1 -48 -91  89  25   88
20  2000/01/01 00:21:00 -60 -90   6  34 -77   34
21  2000/01/01 00:22:00 -28   7 -33 -64  42   56
22  2000/01/01 00:23:00 -29  85  45  29 -20  -38
23  2000/01/01 00:24:00  40 -26  17  18  50 -100
24  2000/01/01 00:25:00 -74  60 -50  -3  81  -91
25  2000/01/01 00:26:00  35  47 -90  19  48  -47
26  2000/01/01 00:27:00 -32  34 -43  33  26   26
27  2000/01/01 00:28:00  74  12 -11 -97 -20  -29
28  2000/01/01 00:29:00  58 -90  -7 -88  29  -89
29  2000/01/01 00:30:00  39 -51 -88 -94 -26  -27

以下是一种方法:

df["time"] = pd.to_datetime(df["time"], format="%Y/%m/%d %H:%M:%S")
new_df = (
    df.set_index("time").resample("15T").agg(lambda x: int(np.mean(x)))
)  # 15分钟重采样

然后:

print(new_df)
# 输出

                     F1  F2  F3  T1  T2  T3
time
2000-01-01 00:00:00  -2 -15 -14  -7   2  32
2000-01-01 00:15:00   0 -10 -16   2   0 -22
2000-01-01 00:30:00   7  -3 -13 -12   0  24
2000-01-01 00:45:00  -1   0  -5 -23 -12 -23

请注意,没有更多上下文的情况下,很难确定重采样值的最佳表示方式,因此我选择了均值,但您可以使用更合适的lambda函数替换它。

另外,我认为无法避免一些信息的丢失,因为重采样/聚合必然会有一些代价。

英文:

With the following toy dataframe:

import random
import numpy as np
import pandas as pd

df = pd.DataFrame(
    {"time": [f"2000/01/01 00:{i:02}:00" for i in range(1, 60)]}
    | {
        key: [random.randrange(-100, 100) for _ in range(1, 60)]
        for key in ("F1", "F2", "F3", "T1", "T2", "T3")
    }
)

print(df.head(30))

# Output
                   time  F1  F2  F3  T1  T2   T3
0   2000/01/01 00:01:00  44 -91   5 -23  97   14
1   2000/01/01 00:02:00 -38  46 -14  -5 -66   39
2   2000/01/01 00:03:00  70 -63 -28 -53  53   77
3   2000/01/01 00:04:00  33 -16  82  98  54   95
4   2000/01/01 00:05:00 -51 -89 -52 -88 -68  -61
5   2000/01/01 00:06:00 -64  69  25 -98  21   63
6   2000/01/01 00:07:00 -52  51 -34  35 -47   83
7   2000/01/01 00:08:00 -10  10 -87 -49  75    7
8   2000/01/01 00:09:00 -51 -95  25 -49 -43  -13
9   2000/01/01 00:10:00 -16  88 -23  -3 -17   71
10  2000/01/01 00:11:00   4 -97   3  53 -35  -83
11  2000/01/01 00:12:00 -94 -17 -88  -5  41   60
12  2000/01/01 00:13:00  91 -14  43  79  -8   14
13  2000/01/01 00:14:00  94  -1 -57   7 -21   91
14  2000/01/01 00:15:00 -60  -2  39 -56 -61   24
15  2000/01/01 00:16:00 -20 -83  30  68 -97  -87
16  2000/01/01 00:17:00   7  70 -65  49  13  -66
17  2000/01/01 00:18:00  29 -70  78  84 -80   -5
18  2000/01/01 00:19:00  57 -57 -78 -75  29  -12
19  2000/01/01 00:20:00  -1 -48 -91  89  25   88
20  2000/01/01 00:21:00 -60 -90   6  34 -77   34
21  2000/01/01 00:22:00 -28   7 -33 -64  42   56
22  2000/01/01 00:23:00 -29  85  45  29 -20  -38
23  2000/01/01 00:24:00  40 -26  17  18  50 -100
24  2000/01/01 00:25:00 -74  60 -50  -3  81  -91
25  2000/01/01 00:26:00  35  47 -90  19  48  -47
26  2000/01/01 00:27:00 -32  34 -43  33  26   26
27  2000/01/01 00:28:00  74  12 -11 -97 -20  -29
28  2000/01/01 00:29:00  58 -90  -7 -88  29  -89
29  2000/01/01 00:30:00  39 -51 -88 -94 -26  -27

Here is one way to do it:

df["time"] = pd.to_datetime(df["time"], format="%Y/%m/%d %H:%M:%S")
new_df = (
    df.set_index("time").resample("15T").agg(lambda x: int(np.mean(x)))
)  # 15 min. resampling

Then:

print(new_df)
# Output

                     F1  F2  F3  T1  T2  T3
time
2000-01-01 00:00:00  -2 -15 -14  -7   2  32
2000-01-01 00:15:00   0 -10 -16   2   0 -22
2000-01-01 00:30:00   7  -3 -13 -12   0  24
2000-01-01 00:45:00  -1   0  -5 -23 -12 -23

Without more context, it's impossible to determine what would be the best representation of the resampled values, so I chose the mean value, but you can replace the lambda function with anything more suitable.

Also, I don't think you will be able to avoid losing some information, as resampling/aggregating necessarily comes at a cost.

huangapple
  • 本文由 发表于 2023年5月6日 22:25:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76189409.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定