2023年5月6日 22:25:06go评论63阅读模式

英文:

How to resampling o multidimensional event without losing information

问题

I have a dataset representing a time series
这是一个代表时间序列的数据集

The time series has 165 events
时间序列有165个事件

Each unique event has 15 rows, FX1, FX2, … FX15
每个独特事件有15行，FX1，FX2，... FX15

The time between each row is 1 minute so one event takes 15 minutes
每行之间的时间间隔为1分钟，所以一个事件需要15分钟

Each row has 6 features F1,F2,F3,T1,T2,T3
每行有6个特征F1、F2、F3、T1、T2、T3

All values are numeric and integers values
所有的值都是数值和整数值

Here is an example of an event :
以下是一个事件的示例：

This event started at 2000/01/01 00:00:00 and ended at 2000/01/01 00:15:00
该事件从2000/01/01 00:00:00开始，结束于2000/01/01 00:15:00

The first row in the event has F1=3,F2=-4,…T3=45
该事件中的第一行具有F1=3，F2=-4，...，T3=45

My question is this
我的问题是这样的

Can I convert one event to just one row without losing information?
我能将一个事件转换为只有一行而不丢失信息吗？

Let's say I use Panda data frame resample() method for downsampling
假设我使用Panda数据框的resample()方法进行降采样

Think of it like resampling a dataset in this case resampling to 15 minutes will create a unique row I think but will the resulting unidimensional data lack any information?
可以将其视为对数据集进行重新采样，在这种情况下，将重新采样为15分钟将创建一个唯一的行，但最终的一维数据会缺少任何信息吗？

So the desired output for the above data frame will be something like this
因此，上述数据框的期望输出将类似于以下内容：

                time  F1   F2   F3   T1   T2    T3
2000/01/01 00:01:00  XF1  XF2  XF3  xT1  xT2  xT3

其中XF1是最能代表F1从0到14（94、-2、-3、39、80、-19、79、....8）的值的值。

英文:

I have a dataset representing a time series
The time series has 165 events

Each unique event has 15 rows, FX1, FX2, … FX15

The time between each row is 1 minute so one event takes 15 minutes

Each row has 6 features F1,F2,F3,T1,T2,T3
All values are numeric and integers values

Here is an example of an event :

This event started at 2000/01/01 00:00:00 and ended at 2000/01/01 00:15:00

The first row in the event has F1=3,F2=-4,…T3=45


                   time  F1   F2   F3  T1  T2   T3
0   2000/01/01 00:01:00  94  -76    0  47   9  -20
1   2000/01/01 00:02:00  -2   85   14  79  92  -95
2   2000/01/01 00:03:00  -3   13 -100  33  74  -43
3   2000/01/01 00:04:00  39   64  -29  32 -73  -44
4   2000/01/01 00:05:00  80   44    3  73  56 -100
5   2000/01/01 00:06:00 -19  -51  -77  32  72   24
6   2000/01/01 00:07:00  79  -69  -87   4  20   19
7   2000/01/01 00:08:00  68    6   95 -76  34   58
8   2000/01/01 00:09:00  26  -59   24  79 -43   48
9   2000/01/01 00:10:00  71    8  -85 -15 -45  -56
10  2000/01/01 00:11:00  51   98    6 -53 -39    5
11  2000/01/01 00:12:00  99   73  -48  -1  64   56
12  2000/01/01 00:13:00 -12   13  -63  51  36   95
13  2000/01/01 00:14:00   8 -100   54  91 -56  -32

My question is this

Can I convert one event to just one row without losing information?

Let's say I use Panda data frame resample() method for downsampling

Think of it like resampling a dataset in this case resampling to 15 minutes will create a unique row I think but will the resulting unidimensional data lack any information?

So the desired output for the above data frame will be something like this

                 time  F1   F2   F3   T1   T2     T3
 2000/01/01 00:01:00  XF1  XF2  XF3  xT1  xT2    xT3

Where XF1 is the value that best represents the array of 14 values that F1 took from 0 to 14 ( 94, -2, -3, 39, 80, -19, 79, ....8).

答案1

得分: 1

以下是翻译好的部分：

使用以下玩具数据框：

import random
import numpy as np
import pandas as pd

df = pd.DataFrame(
    {"time": [f"2000/01/01 00:{i:02}:00" for i in range(1, 60)]}
    | {
        key: [random.randrange(-100, 100) for _ in range(1, 60)]
        for key in ("F1", "F2", "F3", "T1", "T2", "T3")
    }
)

print(df.head(30))

# 输出
                   time  F1  F2  F3  T1  T2   T3
0   2000/01/01 00:01:00  44 -91   5 -23  97   14
1   2000/01/01 00:02:00 -38  46 -14  -5 -66   39
2   2000/01/01 00:03:00  70 -63 -28 -53  53   77
3   2000/01/01 00:04:00  33 -16  82  98  54   95
4   2000/01/01 00:05:00 -51 -89 -52 -88 -68  -61
5   2000/01/01 00:06:00 -64  69  25 -98  21   63
6   2000/01/01 00:07:00 -52  51 -34  35 -47   83
7   2000/01/01 00:08:00 -10  10 -87 -49  75    7
8   2000/01/01 00:09:00 -51 -95  25 -49 -43  -13
9   2000/01/01 00:10:00 -16  88 -23  -3 -17   71
10  2000/01/01 00:11:00   4 -97   3  53 -35  -83
11  2000/01/01 00:12:00 -94 -17 -88  -5  41   60
12  2000/01/01 00:13:00  91 -14  43  79  -8   14
13  2000/01/01 00:14:00  94  -1 -57   7 -21   91
14  2000/01/01 00:15:00 -60  -2  39 -56 -61   24
15  2000/01/01 00:16:00 -20 -83  30  68 -97  -87
16  2000/01/01 00:17:00   7  70 -65  49  13  -66
17  2000/01/01 00:18:00  29 -70  78  84 -80   -5
18  2000/01/01 00:19:00  57 -57 -78 -75  29  -12
19  2000/01/01 00:20:00  -1 -48 -91  89  25   88
20  2000/01/01 00:21:00 -60 -90   6  34 -77   34
21  2000/01/01 00:22:00 -28   7 -33 -64  42   56
22  2000/01/01 00:23:00 -29  85  45  29 -20  -38
23  2000/01/01 00:24:00  40 -26  17  18  50 -100
24  2000/01/01 00:25:00 -74  60 -50  -3  81  -91
25  2000/01/01 00:26:00  35  47 -90  19  48  -47
26  2000/01/01 00:27:00 -32  34 -43  33  26   26
27  2000/01/01 00:28:00  74  12 -11 -97 -20  -29
28  2000/01/01 00:29:00  58 -90  -7 -88  29  -89
29  2000/01/01 00:30:00  39 -51 -88 -94 -26  -27

以下是一种方法：

df["time"] = pd.to_datetime(df["time"], format="%Y/%m/%d %H:%M:%S")
new_df = (
    df.set_index("time").resample("15T").agg(lambda x: int(np.mean(x)))
)  # 15分钟重采样

然后：

print(new_df)
# 输出

                     F1  F2  F3  T1  T2  T3
time
2000-01-01 00:00:00  -2 -15 -14  -7   2  32
2000-01-01 00:15:00   0 -10 -16   2   0 -22
2000-01-01 00:30:00   7  -3 -13 -12   0  24
2000-01-01 00:45:00  -1   0  -5 -23 -12 -23

请注意，没有更多上下文的情况下，很难确定重采样值的最佳表示方式，因此我选择了均值，但您可以使用更合适的lambda函数替换它。

另外，我认为无法避免一些信息的丢失，因为重采样/聚合必然会有一些代价。

英文:

With the following toy dataframe:

import random
import numpy as np
import pandas as pd

df = pd.DataFrame(
    {&quot;time&quot;: [f&quot;2000/01/01 00:{i:02}:00&quot; for i in range(1, 60)]}
    | {
        key: [random.randrange(-100, 100) for _ in range(1, 60)]
        for key in (&quot;F1&quot;, &quot;F2&quot;, &quot;F3&quot;, &quot;T1&quot;, &quot;T2&quot;, &quot;T3&quot;)
    }
)

print(df.head(30))

# Output
                   time  F1  F2  F3  T1  T2   T3
0   2000/01/01 00:01:00  44 -91   5 -23  97   14
1   2000/01/01 00:02:00 -38  46 -14  -5 -66   39
2   2000/01/01 00:03:00  70 -63 -28 -53  53   77
3   2000/01/01 00:04:00  33 -16  82  98  54   95
4   2000/01/01 00:05:00 -51 -89 -52 -88 -68  -61
5   2000/01/01 00:06:00 -64  69  25 -98  21   63
6   2000/01/01 00:07:00 -52  51 -34  35 -47   83
7   2000/01/01 00:08:00 -10  10 -87 -49  75    7
8   2000/01/01 00:09:00 -51 -95  25 -49 -43  -13
9   2000/01/01 00:10:00 -16  88 -23  -3 -17   71
10  2000/01/01 00:11:00   4 -97   3  53 -35  -83
11  2000/01/01 00:12:00 -94 -17 -88  -5  41   60
12  2000/01/01 00:13:00  91 -14  43  79  -8   14
13  2000/01/01 00:14:00  94  -1 -57   7 -21   91
14  2000/01/01 00:15:00 -60  -2  39 -56 -61   24
15  2000/01/01 00:16:00 -20 -83  30  68 -97  -87
16  2000/01/01 00:17:00   7  70 -65  49  13  -66
17  2000/01/01 00:18:00  29 -70  78  84 -80   -5
18  2000/01/01 00:19:00  57 -57 -78 -75  29  -12
19  2000/01/01 00:20:00  -1 -48 -91  89  25   88
20  2000/01/01 00:21:00 -60 -90   6  34 -77   34
21  2000/01/01 00:22:00 -28   7 -33 -64  42   56
22  2000/01/01 00:23:00 -29  85  45  29 -20  -38
23  2000/01/01 00:24:00  40 -26  17  18  50 -100
24  2000/01/01 00:25:00 -74  60 -50  -3  81  -91
25  2000/01/01 00:26:00  35  47 -90  19  48  -47
26  2000/01/01 00:27:00 -32  34 -43  33  26   26
27  2000/01/01 00:28:00  74  12 -11 -97 -20  -29
28  2000/01/01 00:29:00  58 -90  -7 -88  29  -89
29  2000/01/01 00:30:00  39 -51 -88 -94 -26  -27

Here is one way to do it:

df[&quot;time&quot;] = pd.to_datetime(df[&quot;time&quot;], format=&quot;%Y/%m/%d %H:%M:%S&quot;)
new_df = (
    df.set_index(&quot;time&quot;).resample(&quot;15T&quot;).agg(lambda x: int(np.mean(x)))
)  # 15 min. resampling

Then:

print(new_df)
# Output

                     F1  F2  F3  T1  T2  T3
time
2000-01-01 00:00:00  -2 -15 -14  -7   2  32
2000-01-01 00:15:00   0 -10 -16   2   0 -22
2000-01-01 00:30:00   7  -3 -13 -12   0  24
2000-01-01 00:45:00  -1   0  -5 -23 -12 -23

Without more context, it's impossible to determine what would be the best representation of the resampled values, so I chose the mean value, but you can replace the lambda function with anything more suitable.

Also, I don't think you will be able to avoid losing some information, as resampling/aggregating necessarily comes at a cost.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在不丢失信息的情况下对多维事件进行重新采样

问题

答案1

I am using langchain to chat with my database I want json format as output which includes fieldname as key

如何向继承自pandas.DataFrame的类中添加新属性？

如何在聚合操作中创建带有条件的计算列？

匹配 pandas 数据框中的值，并用主表中匹配的值替换。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论