如何计算30分钟窗口的最后值与第一个值之间的差异,并保留其他列?

huangapple go评论71阅读模式
英文:

How to take difference between last value and first value of 30 min window, and keep other columns as well?

问题

Parquet:
示例数据集链接
CSV 格式:
示例数据集链接

更新后的数据集: 新数据集链接

预期结果应该包括在30分钟采样窗口内最后一个值和第一个值之间的差异结果如上所示的示例数据),即:

项目          值     代理状态    原始时间戳                                    代理时间戳
Channel1.Device1.Tag1    847    正常      2023-07-28T13:09:00.0098328+09:00    2023-07-28T13:09:00.0000000
Channel1.Device1.Tag2    0      正常      2023-07-28T13:09:00.0408696+09:00    2023-07-28T13:09:00.0000000
Channel1.Device1.Tag1    848    正常      2023-07-28T13:09:05.0138770+09:00    2023-07-28T13:09:05.0000000
...

预期结果/计算:

在30分钟内采样后,结果将包含来自这两个项目的数据,如下所示:

如何计算30分钟窗口的最后值与第一个值之间的差异,并保留其他列?

  • 对于每个项目,例如:对于 Channel1.Device1.Tag2:
    最后一个值 - 30分钟窗口的第一个值,即:

最后一行的值(878):

39  Channel1.Device1.Tag2   878 坏    2023-07-28T13:30:15.0338704+09:08    2023-07-28T13:30:15.0000000 

第一行的值(0):

1   Channel1.Device1.Tag2   0   正常   2023-07-28T13:09:00.0408696+09:00    2023-07-28T13:09:00.0000000

预期结果/数据框:

如何计算30分钟窗口的最后值与第一个值之间的差异,并保留其他列?

如图所示,值878和3对应于Channel1.Device1.Tag2和Channel1.Device1.Tag1,取(最后一个值 - 第一个值)之间的差值:(878 - 0) 和 (850-847),而其他列的值保持不变。

更新:

新数据集经过了来自mozway的提供的解决方案,结果有轻微偏差(而不是取00:30:00 - 00:00:00之间的差值),因为每个项目标签名称都有重复的时间戳值。如何解决这个问题?或者如何跳过该行中的重复值。

我必须添加这张图片,以解释:

如何计算30分钟窗口的最后值与第一个值之间的差异,并保留其他列?

英文:

Parquet:
Link to the sample dataset
CSV formatted:
Link to the sample dataset \

Updated dataset: Link to new dataset

The expected result should have the result of difference between last value and first value of the 30 min sample window (From the sample data shown above), that is:

	Item	Value	AgentStatus	OriginalTimestamp	AgentTimeStamp
0	Channel1.Device1.Tag1	847	good	2023-07-28T13:09:00.0098328+09:00	2023-07-28T13:09:00.0000000
1	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:00.0408696+09:00	2023-07-28T13:09:00.0000000
2	Channel1.Device1.Tag1	848	good	2023-07-28T13:09:05.0138770+09:00	2023-07-28T13:09:05.0000000
3	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:05.0454734+09:00	2023-07-28T13:09:05.0000000
4	Channel1.Device1.Tag1	849	good	2023-07-28T13:09:10.0073605+09:00	2023-07-28T13:09:10.0000000
5	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:10.0379516+09:00	2023-07-28T13:09:10.0000000
6	Channel1.Device1.Tag1	850	good	2023-07-28T13:09:15.0074263+09:00	2023-07-28T13:09:15.0000000
7	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:15.0387691+09:00	2023-07-28T13:09:15.0000000
8	Channel1.Device1.Tag1	851	good	2023-07-28T13:09:20.0176840+09:00	2023-07-28T13:09:20.0000000
9	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:20.0329268+09:00	2023-07-28T13:09:20.0000000
10	Channel1.Device1.Tag1	852	good	2023-07-28T13:09:25.0070191+09:00	2023-07-28T13:09:25.0000000
11	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:25.0384699+09:00	2023-07-28T13:09:25.0000000
12	Channel1.Device1.Tag1	853	good	2023-07-28T13:09:30.0109244+09:00	2023-07-28T13:09:30.0000000
13	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:30.0417249+09:00	2023-07-28T13:09:30.0000000
14	Channel1.Device1.Tag1	854	good	2023-07-28T13:09:35.0118763+09:00	2023-07-28T13:09:35.0000000
15	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:35.0429050+09:00	2023-07-28T13:09:35.0000000
16	Channel1.Device1.Tag1	855	good	2023-07-28T13:09:40.0027594+09:00	2023-07-28T13:09:40.0000000
17	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:40.0340476+09:00	2023-07-28T13:09:40.0000000
18	Channel1.Device1.Tag1	856	good	2023-07-28T13:09:45.0029277+09:00	2023-07-28T13:09:45.0000000
19	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:45.0336946+09:00	2023-07-28T13:09:45.0000000
20	Channel1.Device1.Tag1	857	good	2023-07-28T13:09:50.0153041+09:00	2023-07-28T13:09:50.0000000
21	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:50.0459796+09:00	2023-07-28T13:09:50.0000000
22	Channel1.Device1.Tag1	858	good	2023-07-28T13:09:55.0103680+09:00	2023-07-28T13:09:55.0000000
23	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:55.0412343+09:00	2023-07-28T13:09:55.0000000
24	Channel1.Device1.Tag1	859	good	2023-07-28T13:10:00.0095407+09:00	2023-07-28T13:10:00.0000000
25	Channel1.Device1.Tag2	0	good	2023-07-28T13:10:00.0395870+09:00	2023-07-28T13:10:00.0000000
26	Channel1.Device1.Tag1	860	good	2023-07-28T13:10:05.0069727+09:00	2023-07-28T13:10:05.0000000
27	Channel1.Device1.Tag2	0	good	2023-07-28T13:10:05.0374699+09:00	2023-07-28T13:10:05.0000000
28	Channel1.Device1.Tag1	861	good	2023-07-28T13:10:10.0113827+09:00	2023-07-28T13:10:10.0000000
29	Channel1.Device1.Tag2	0	good	2023-07-28T13:10:10.0431140+09:00	2023-07-28T13:10:10.0000000
30	Channel1.Device1.Tag1	862	good	2023-07-28T13:10:15.0024582+09:00	2023-07-28T13:10:15.0000000
31	Channel1.Device1.Tag2	0	good	2023-07-28T13:10:15.0338704+09:00	2023-07-28T13:10:15.0000000
32	Channel1.Device1.Tag2	0	good	2023-07-28T13:11:15.0338704+09:01	2023-07-28T13:11:15.0000000
33	Channel1.Device1.Tag2	0	good	2023-07-28T13:12:15.0338704+09:02	2023-07-28T13:12:15.0000000
34	Channel1.Device1.Tag2	0	good	2023-07-28T13:15:15.0338704+09:03	2023-07-28T13:15:15.0000000
35	Channel1.Device1.Tag2	0	good	2023-07-28T13:20:15.0338704+09:04	2023-07-28T13:20:15.0000000
36	Channel1.Device1.Tag2	0	good	2023-07-28T13:21:15.0338704+09:05	2023-07-28T13:21:15.0000000
37	Channel1.Device1.Tag2	0	good	2023-07-28T13:22:15.0338704+09:06	2023-07-28T13:22:15.0000000
38	Channel1.Device1.Tag2	0	good	2023-07-28T13:25:15.0338704+09:07	2023-07-28T13:25:15.0000000
39	Channel1.Device1.Tag2	878	bad	2023-07-28T13:30:15.0338704+09:08	2023-07-28T13:30:15.0000000
40	Channel1.Device1.Tag2	0	good	2023-07-28T13:31:15.0338704+09:09	2023-07-28T13:31:15.0000000

Expected result/calculation:

After sampling within 30 mins, the result will contain data from these two items as shown:
如何计算30分钟窗口的最后值与第一个值之间的差异,并保留其他列?

  • For each Item, Example: For Channel1.Device1.Tag2:
    Last Value - First Value of the 30 minute windows, i.e:

Last Row Value (878):

39  Channel1.Device1.Tag2   878 bad 2023-07-28T13:30:15.0338704+09:08   2023-07-28T13:30:15.0000000 

First row value (0):

1   Channel1.Device1.Tag2   0   good    2023-07-28T13:09:00.0408696+09:00   2023-07-28T13:09:00.0000000

Expected result/ dataframe:

如何计算30分钟窗口的最后值与第一个值之间的差异,并保留其他列?

As seen, the Value 878 and 3 corresponds to Channel1.Device1.Tag2 and Channel1.Device1.Tag1, taking the difference between (last value - first value): (878 - 0) and (850-847), whereas other column values are retained.

Update:

The new dataset was run through the provided solution from mozway, and the results has a slight deviation (Instead of taking diff b/w 00:30:00 - 00:00:00) because of the duplicate value of timestamp against each item tag name. How to resolve this ? Or How to skip that duplicate value from the row.

I have to add this image, to explain:
如何计算30分钟窗口的最后值与第一个值之间的差异,并保留其他列?

答案1

得分: 1

以下是已经翻译好的代码部分:

r = df.groupby('Item').resample('30T', on='AgentTimeStamp', origin='start', label='right')

out = (r.last()
        .assign(Value=lambda d: d[['Value']].sub(r[['Value']].first()))
        .reset_index('AgentTimeStamp')[df.columns]
      )

Output:

                       index                   Item  Value AgentStatus      AgentTimeStamp
Item                                                                                      
Channel1.Device1.Tag1      2  Channel1.Device1.Tag1      1        good 2023-07-28 13:39:00
Channel1.Device1.Tag2     39  Channel1.Device1.Tag2    878         bad 2023-07-28 13:39:00
英文:

What you want is unclear, maybe you can combine groupby and resample:

r = df.groupby('Item').resample('30T', on='AgentTimeStamp', origin='start', label='right')

out = (r.last()
        .assign(Value=lambda d: d[['Value']].sub(r[['Value']].first()))
        .reset_index('AgentTimeStamp')[df.columns]
      )

Output:

                       index                   Item  Value AgentStatus      AgentTimeStamp
Item                                                                                      
Channel1.Device1.Tag1      2  Channel1.Device1.Tag1      1        good 2023-07-28 13:39:00
Channel1.Device1.Tag2     39  Channel1.Device1.Tag2    878         bad 2023-07-28 13:39:00

答案2

得分: 0

以下是翻译好的部分:

假设您希望每个项目的观察都有30分钟的时间差您可以使用自连接来实现这个函数适用于我的情况所以我希望它能帮助您

def get_lagged_df(df: pd.DataFrame,
                  id_col: str ='id',
                  time_col: str = 'time',
                  val_col: str = 'val',
                  display_na_rows: bool = True) -> pd.DataFrame:

# 基于ID进行自连接
joined_df = pd.merge(df, df, on = id_col, suffixes = ('_end', '_start'))

# 仅保留结束时间在开始时间之上的行
joined_df = joined_df[joined_df[f'{time_col}_end'] > joined_df[f'{time_col}_start']]

# 用于获取结束时间和开始时间之间的差异(以分钟为单位)的Lambda函数
lambda_func = lambda row: int((pd.to_datetime(row[f'{time_col}_end']) - pd.to_datetime(row[f'{time_col}_start']))
                              .seconds / 60)

# 获取结束时间和开始时间之间的时间差(以分钟为单位)
joined_df[f'{time_col}_difference'] = joined_df.apply(lambda_func, axis = 1)

# 仅保留时间差为30分钟的行 -> 30分钟窗口
joined_df = joined_df[joined_df[f'{time_col}_difference'] == 30]

# 获取30分钟窗口的第一个观察和最后一个观察之间的差异
joined_df[f'{val_col}_difference'] = joined_df[f'{val_col}_end'] - joined_df[f'{val_col}_start']

# 添加N/A行(即,没有任何开始时间的行,因此也没有时间和值差异)
if display_na_rows:
    df_w_na_rows = df.rename(columns = {f'{val_col}': f'{val_col}_end', f'{time_col}':f'{time_col}_end'})

    joined_df = pd.merge(df_w_na_rows,
                         joined_df.drop(f'{val_col}_end', axis = 1),
                         on = [id_col, f'{time_col}_end'],
                         how = 'left',
                        )

# 调整列和行的顺序
joined_df = (joined_df[[id_col, f'{time_col}_start', f'{time_col}_end', f'{time_col}_difference',
                        f'{val_col}_start', f'{val_col}_end', f'{val_col}_difference']]
             .sort_values(by=[id_col, f'{time_col}_end'])
             .reset_index(drop = True)
            )

return joined_df

假设您的数据格式如下:

id val time
1 0.638129 2023-01-01 00:00:00
1 0.047401 2023-01-01 00:01:00
1 0.781685 2023-01-01 00:02:00
1 0.839184 2023-01-01 00:03:00
1 0.526894 2023-01-01 00:04:00
.. ........ ..................
3 0.370669 2023-01-01 00:56:00
3 0.565692 2023-01-01 00:57:00
3 0.121998 2023-01-01 00:58:00
3 0.872352 2023-01-01 00:59:00
3 0.624171 2023-01-02 01:00:00

其中ID 1在时间戳00:00和00:30的值如下:

id val time
1 0.638129 2023-01-01 00:00:00
1 0.184684 2023-01-01 00:30:00

如果在不显示具有缺失值的行(即,没有开始时间的行,因此也没有差异)的情况下应用该函数,我们将得到每个ID观察的时间$t$和时间$t-30$之间值的差异:

get_lagged_df(df, display_na_rows = False)

如果想要保留没有开始时间的行,只需将display_na_rows参数更改为True

get_lagged_df(df, display_na_rows = True)
英文:

Assuming that you want a 30-minutes window difference for each item's observation, you can achieve it using self join. This function works for my case, so I hope it'll help:

def get_lagged_df(df: pd.DataFrame,
id_col: str ='id',
time_col: str = 'time',
val_col: str = 'val',
display_na_rows: bool = True) -> pd.DataFrame:
# Self join based on the ID
joined_df = pd.merge(df, df, on = id_col, suffixes = ('_end', '_start'))
# Keep only the rows where end time is above the start time
joined_df = joined_df[joined_df[f'{time_col}_end'] > joined_df[f'{time_col}_start']]
# Lambda function for getting differences between end time and start time in minutes
lambda_func = lambda row: int((pd.to_datetime(row[f'{time_col}_end']) - pd.to_datetime(row[f'{time_col}_start']))
.seconds / 60)
# Get the time differences in minutes between end time and start time
joined_df[f'{time_col}_difference'] = joined_df.apply(lambda_func, axis = 1)
# Kepp only the rows where the time difference is 30 minutes -> 30 min. window
joined_df = joined_df[joined_df[f'{time_col}_difference'] == 30]
# Get the difference between the first observation and last observation of 30 min. window
joined_df[f'{val_col}_difference'] = joined_df[f'{val_col}_end'] - joined_df[f'{val_col}_start']
# Appending N/A rows (i.e., rows which don't have any start times, hence no time and val differences)
if display_na_rows:
df_w_na_rows = df.rename(columns = {f'{val_col}': f'{val_col}_end', f'{time_col}':f'{time_col}_end'})
joined_df = pd.merge(df_w_na_rows,
joined_df.drop(f'{val_col}_end', axis = 1),
on = [id_col, f'{time_col}_end'],
how = 'left',
)
# Adjusting columns' and rows' orders
joined_df = (joined_df[[id_col, f'{time_col}_start', f'{time_col}_end', f'{time_col}_difference',
f'{val_col}_start', f'{val_col}_end', f'{val_col}_difference']]
.sort_values(by=[id_col, f'{time_col}_end'])
.reset_index(drop = True)
)
return joined_df

Assume that you have this data in the following format:

id val time
1 0.638129 2023-01-01 00:00:00
1 0.047401 2023-01-01 00:01:00
1 0.781685 2023-01-01 00:02:00
1 0.839184 2023-01-01 00:03:00
1 0.526894 2023-01-01 00:04:00
.. ........ ..................
3 0.370669 2023-01-01 00:56:00
3 0.565692 2023-01-01 00:57:00
3 0.121998 2023-01-01 00:58:00
3 0.872352 2023-01-01 00:59:00
3 0.624171 2023-01-02 01:00:00

Where values for ID 1 in timestamps 00:00 and 00:30 are as follows:

id val time
1 0.638129 2023-01-01 00:00:00
1 0.184684 2023-01-01 00:30:00

When applying the function without displaying the rows with missing values (i.e., rows with no start times, hence no differences as well), we get difference between the value at time $t$ and the value at time $t-30$ for each ID's observation:

get_lagged_df(df, display_na_rows = False)
id time_start time_end time_difference val_start val_end val_difference
1 2023-01-01 00:00:00 2023-01-01 00:30:00 30 0.638129 0.184684 -0.453445
1 2023-01-01 00:01:00 2023-01-01 00:31:00 30 0.047401 0.613936 0.566535
1 2023-01-01 00:02:00 2023-01-01 00:32:00 30 0.781685 0.865400 0.083714
1 2023-01-01 00:03:00 2023-01-01 00:33:00 30 0.839184 0.089480 -0.749704
1 2023-01-01 00:04:00 2023-01-01 00:34:00 30 0.526894 0.541837 0.014943
.. ... ... ... ... ... ...
3 2023-01-01 00:26:00 2023-01-01 00:56:00 30 0.812442 0.370669 -0.441773
3 2023-01-01 00:27:00 2023-01-01 00:57:00 30 0.352933 0.565692 0.212759
3 2023-01-01 00:28:00 2023-01-01 00:58:00 30 0.186111 0.121998 -0.064113
3 2023-01-01 00:29:00 2023-01-01 00:59:00 30 0.671967 0.872352 0.200385
3 2023-01-01 00:30:00 2023-01-02 01:00:00 30 0.008082 0.624171 0.616090

If we want to keep the rows with no start times as well, just change the display_na_rows argument to True:

get_lagged_df(df, display_na_rows = True)
id time_start time_end time_difference val_start val_end val_difference
1 NaT 2023-01-01 00:00:00 NaN NaN 0.638129 NaN
1 NaT 2023-01-01 00:01:00 NaN NaN 0.047401 NaN
1 NaT 2023-01-01 00:02:00 NaN NaN 0.781685 NaN
1 NaT 2023-01-01 00:03:00 NaN NaN 0.839184 NaN
1 NaT 2023-01-01 00:04:00 NaN NaN 0.526894 NaN
.. ... ... ... ... ... ...
3 2023-01-01 00:26:00 2023-01-01 00:56:00 30.0 0.812442 0.370669 -0.441773
3 2023-01-01 00:27:00 2023-01-01 00:57:00 30.0 0.352933 0.565692 0.212759
3 2023-01-01 00:28:00 2023-01-01 00:58:00 30.0 0.186111 0.121998 -0.064113
3 2023-01-01 00:29:00 2023-01-01 00:59:00 30.0 0.671967 0.872352 0.200385
3 2023-01-01 00:30:00 2023-01-02 01:00:00 30.0 0.008082 0.624171 0.616090

huangapple
  • 本文由 发表于 2023年7月31日 19:13:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76803053.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定