英文:
How to take difference between last value and first value of 30 min window, and keep other columns as well?
问题
Parquet:
示例数据集链接
CSV 格式:
示例数据集链接
更新后的数据集: 新数据集链接
预期结果应该包括在30分钟采样窗口内最后一个值和第一个值之间的差异结果(如上所示的示例数据),即:
项目 值 代理状态 原始时间戳 代理时间戳
Channel1.Device1.Tag1 847 正常 2023-07-28T13:09:00.0098328+09:00 2023-07-28T13:09:00.0000000
Channel1.Device1.Tag2 0 正常 2023-07-28T13:09:00.0408696+09:00 2023-07-28T13:09:00.0000000
Channel1.Device1.Tag1 848 正常 2023-07-28T13:09:05.0138770+09:00 2023-07-28T13:09:05.0000000
...
预期结果/计算:
在30分钟内采样后,结果将包含来自这两个项目的数据,如下所示:
- 对于每个项目,例如:对于 Channel1.Device1.Tag2:
最后一个值 - 30分钟窗口的第一个值,即:
最后一行的值(878):
39 Channel1.Device1.Tag2 878 坏 2023-07-28T13:30:15.0338704+09:08 2023-07-28T13:30:15.0000000
第一行的值(0):
1 Channel1.Device1.Tag2 0 正常 2023-07-28T13:09:00.0408696+09:00 2023-07-28T13:09:00.0000000
预期结果/数据框:
如图所示,值878和3对应于Channel1.Device1.Tag2和Channel1.Device1.Tag1,取(最后一个值 - 第一个值)之间的差值:(878 - 0) 和 (850-847),而其他列的值保持不变。
更新:
新数据集经过了来自mozway的提供的解决方案,结果有轻微偏差(而不是取00:30:00 - 00:00:00之间的差值),因为每个项目标签名称都有重复的时间戳值。如何解决这个问题?或者如何跳过该行中的重复值。
我必须添加这张图片,以解释:
英文:
Parquet:
Link to the sample dataset
CSV formatted:
Link to the sample dataset \
Updated dataset: Link to new dataset
The expected result should have the result of difference between last value and first value of the 30 min sample window (From the sample data shown above), that is:
Item Value AgentStatus OriginalTimestamp AgentTimeStamp
0 Channel1.Device1.Tag1 847 good 2023-07-28T13:09:00.0098328+09:00 2023-07-28T13:09:00.0000000
1 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:00.0408696+09:00 2023-07-28T13:09:00.0000000
2 Channel1.Device1.Tag1 848 good 2023-07-28T13:09:05.0138770+09:00 2023-07-28T13:09:05.0000000
3 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:05.0454734+09:00 2023-07-28T13:09:05.0000000
4 Channel1.Device1.Tag1 849 good 2023-07-28T13:09:10.0073605+09:00 2023-07-28T13:09:10.0000000
5 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:10.0379516+09:00 2023-07-28T13:09:10.0000000
6 Channel1.Device1.Tag1 850 good 2023-07-28T13:09:15.0074263+09:00 2023-07-28T13:09:15.0000000
7 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:15.0387691+09:00 2023-07-28T13:09:15.0000000
8 Channel1.Device1.Tag1 851 good 2023-07-28T13:09:20.0176840+09:00 2023-07-28T13:09:20.0000000
9 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:20.0329268+09:00 2023-07-28T13:09:20.0000000
10 Channel1.Device1.Tag1 852 good 2023-07-28T13:09:25.0070191+09:00 2023-07-28T13:09:25.0000000
11 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:25.0384699+09:00 2023-07-28T13:09:25.0000000
12 Channel1.Device1.Tag1 853 good 2023-07-28T13:09:30.0109244+09:00 2023-07-28T13:09:30.0000000
13 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:30.0417249+09:00 2023-07-28T13:09:30.0000000
14 Channel1.Device1.Tag1 854 good 2023-07-28T13:09:35.0118763+09:00 2023-07-28T13:09:35.0000000
15 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:35.0429050+09:00 2023-07-28T13:09:35.0000000
16 Channel1.Device1.Tag1 855 good 2023-07-28T13:09:40.0027594+09:00 2023-07-28T13:09:40.0000000
17 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:40.0340476+09:00 2023-07-28T13:09:40.0000000
18 Channel1.Device1.Tag1 856 good 2023-07-28T13:09:45.0029277+09:00 2023-07-28T13:09:45.0000000
19 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:45.0336946+09:00 2023-07-28T13:09:45.0000000
20 Channel1.Device1.Tag1 857 good 2023-07-28T13:09:50.0153041+09:00 2023-07-28T13:09:50.0000000
21 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:50.0459796+09:00 2023-07-28T13:09:50.0000000
22 Channel1.Device1.Tag1 858 good 2023-07-28T13:09:55.0103680+09:00 2023-07-28T13:09:55.0000000
23 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:55.0412343+09:00 2023-07-28T13:09:55.0000000
24 Channel1.Device1.Tag1 859 good 2023-07-28T13:10:00.0095407+09:00 2023-07-28T13:10:00.0000000
25 Channel1.Device1.Tag2 0 good 2023-07-28T13:10:00.0395870+09:00 2023-07-28T13:10:00.0000000
26 Channel1.Device1.Tag1 860 good 2023-07-28T13:10:05.0069727+09:00 2023-07-28T13:10:05.0000000
27 Channel1.Device1.Tag2 0 good 2023-07-28T13:10:05.0374699+09:00 2023-07-28T13:10:05.0000000
28 Channel1.Device1.Tag1 861 good 2023-07-28T13:10:10.0113827+09:00 2023-07-28T13:10:10.0000000
29 Channel1.Device1.Tag2 0 good 2023-07-28T13:10:10.0431140+09:00 2023-07-28T13:10:10.0000000
30 Channel1.Device1.Tag1 862 good 2023-07-28T13:10:15.0024582+09:00 2023-07-28T13:10:15.0000000
31 Channel1.Device1.Tag2 0 good 2023-07-28T13:10:15.0338704+09:00 2023-07-28T13:10:15.0000000
32 Channel1.Device1.Tag2 0 good 2023-07-28T13:11:15.0338704+09:01 2023-07-28T13:11:15.0000000
33 Channel1.Device1.Tag2 0 good 2023-07-28T13:12:15.0338704+09:02 2023-07-28T13:12:15.0000000
34 Channel1.Device1.Tag2 0 good 2023-07-28T13:15:15.0338704+09:03 2023-07-28T13:15:15.0000000
35 Channel1.Device1.Tag2 0 good 2023-07-28T13:20:15.0338704+09:04 2023-07-28T13:20:15.0000000
36 Channel1.Device1.Tag2 0 good 2023-07-28T13:21:15.0338704+09:05 2023-07-28T13:21:15.0000000
37 Channel1.Device1.Tag2 0 good 2023-07-28T13:22:15.0338704+09:06 2023-07-28T13:22:15.0000000
38 Channel1.Device1.Tag2 0 good 2023-07-28T13:25:15.0338704+09:07 2023-07-28T13:25:15.0000000
39 Channel1.Device1.Tag2 878 bad 2023-07-28T13:30:15.0338704+09:08 2023-07-28T13:30:15.0000000
40 Channel1.Device1.Tag2 0 good 2023-07-28T13:31:15.0338704+09:09 2023-07-28T13:31:15.0000000
Expected result/calculation:
After sampling within 30 mins, the result will contain data from these two items as shown:
- For each Item, Example: For Channel1.Device1.Tag2:
Last Value - First Value of the 30 minute windows, i.e:
Last Row Value (878):
39 Channel1.Device1.Tag2 878 bad 2023-07-28T13:30:15.0338704+09:08 2023-07-28T13:30:15.0000000
First row value (0):
1 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:00.0408696+09:00 2023-07-28T13:09:00.0000000
Expected result/ dataframe:
As seen, the Value 878 and 3 corresponds to Channel1.Device1.Tag2 and Channel1.Device1.Tag1, taking the difference between (last value - first value): (878 - 0) and (850-847), whereas other column values are retained.
Update:
The new dataset was run through the provided solution from mozway, and the results has a slight deviation (Instead of taking diff b/w 00:30:00 - 00:00:00) because of the duplicate value of timestamp against each item tag name. How to resolve this ? Or How to skip that duplicate value from the row.
答案1
得分: 1
以下是已经翻译好的代码部分:
r = df.groupby('Item').resample('30T', on='AgentTimeStamp', origin='start', label='right')
out = (r.last()
.assign(Value=lambda d: d[['Value']].sub(r[['Value']].first()))
.reset_index('AgentTimeStamp')[df.columns]
)
Output:
index Item Value AgentStatus AgentTimeStamp
Item
Channel1.Device1.Tag1 2 Channel1.Device1.Tag1 1 good 2023-07-28 13:39:00
Channel1.Device1.Tag2 39 Channel1.Device1.Tag2 878 bad 2023-07-28 13:39:00
英文:
What you want is unclear, maybe you can combine groupby
and resample
:
r = df.groupby('Item').resample('30T', on='AgentTimeStamp', origin='start', label='right')
out = (r.last()
.assign(Value=lambda d: d[['Value']].sub(r[['Value']].first()))
.reset_index('AgentTimeStamp')[df.columns]
)
Output:
index Item Value AgentStatus AgentTimeStamp
Item
Channel1.Device1.Tag1 2 Channel1.Device1.Tag1 1 good 2023-07-28 13:39:00
Channel1.Device1.Tag2 39 Channel1.Device1.Tag2 878 bad 2023-07-28 13:39:00
答案2
得分: 0
以下是翻译好的部分:
假设您希望每个项目的观察都有30分钟的时间差,您可以使用自连接来实现。这个函数适用于我的情况,所以我希望它能帮助您:
def get_lagged_df(df: pd.DataFrame,
id_col: str ='id',
time_col: str = 'time',
val_col: str = 'val',
display_na_rows: bool = True) -> pd.DataFrame:
# 基于ID进行自连接
joined_df = pd.merge(df, df, on = id_col, suffixes = ('_end', '_start'))
# 仅保留结束时间在开始时间之上的行
joined_df = joined_df[joined_df[f'{time_col}_end'] > joined_df[f'{time_col}_start']]
# 用于获取结束时间和开始时间之间的差异(以分钟为单位)的Lambda函数
lambda_func = lambda row: int((pd.to_datetime(row[f'{time_col}_end']) - pd.to_datetime(row[f'{time_col}_start']))
.seconds / 60)
# 获取结束时间和开始时间之间的时间差(以分钟为单位)
joined_df[f'{time_col}_difference'] = joined_df.apply(lambda_func, axis = 1)
# 仅保留时间差为30分钟的行 -> 30分钟窗口
joined_df = joined_df[joined_df[f'{time_col}_difference'] == 30]
# 获取30分钟窗口的第一个观察和最后一个观察之间的差异
joined_df[f'{val_col}_difference'] = joined_df[f'{val_col}_end'] - joined_df[f'{val_col}_start']
# 添加N/A行(即,没有任何开始时间的行,因此也没有时间和值差异)
if display_na_rows:
df_w_na_rows = df.rename(columns = {f'{val_col}': f'{val_col}_end', f'{time_col}':f'{time_col}_end'})
joined_df = pd.merge(df_w_na_rows,
joined_df.drop(f'{val_col}_end', axis = 1),
on = [id_col, f'{time_col}_end'],
how = 'left',
)
# 调整列和行的顺序
joined_df = (joined_df[[id_col, f'{time_col}_start', f'{time_col}_end', f'{time_col}_difference',
f'{val_col}_start', f'{val_col}_end', f'{val_col}_difference']]
.sort_values(by=[id_col, f'{time_col}_end'])
.reset_index(drop = True)
)
return joined_df
假设您的数据格式如下:
id | val | time |
---|---|---|
1 | 0.638129 | 2023-01-01 00:00:00 |
1 | 0.047401 | 2023-01-01 00:01:00 |
1 | 0.781685 | 2023-01-01 00:02:00 |
1 | 0.839184 | 2023-01-01 00:03:00 |
1 | 0.526894 | 2023-01-01 00:04:00 |
.. | ........ | .................. |
3 | 0.370669 | 2023-01-01 00:56:00 |
3 | 0.565692 | 2023-01-01 00:57:00 |
3 | 0.121998 | 2023-01-01 00:58:00 |
3 | 0.872352 | 2023-01-01 00:59:00 |
3 | 0.624171 | 2023-01-02 01:00:00 |
其中ID 1在时间戳00:00和00:30的值如下:
id | val | time |
---|---|---|
1 | 0.638129 | 2023-01-01 00:00:00 |
1 | 0.184684 | 2023-01-01 00:30:00 |
如果在不显示具有缺失值的行(即,没有开始时间的行,因此也没有差异)的情况下应用该函数,我们将得到每个ID观察的时间$t$和时间$t-30$之间值的差异:
get_lagged_df(df, display_na_rows = False)
如果想要保留没有开始时间的行,只需将display_na_rows
参数更改为True
:
get_lagged_df(df, display_na_rows = True)
英文:
Assuming that you want a 30-minutes window difference for each item's observation, you can achieve it using self join. This function works for my case, so I hope it'll help:
def get_lagged_df(df: pd.DataFrame,
id_col: str ='id',
time_col: str = 'time',
val_col: str = 'val',
display_na_rows: bool = True) -> pd.DataFrame:
# Self join based on the ID
joined_df = pd.merge(df, df, on = id_col, suffixes = ('_end', '_start'))
# Keep only the rows where end time is above the start time
joined_df = joined_df[joined_df[f'{time_col}_end'] > joined_df[f'{time_col}_start']]
# Lambda function for getting differences between end time and start time in minutes
lambda_func = lambda row: int((pd.to_datetime(row[f'{time_col}_end']) - pd.to_datetime(row[f'{time_col}_start']))
.seconds / 60)
# Get the time differences in minutes between end time and start time
joined_df[f'{time_col}_difference'] = joined_df.apply(lambda_func, axis = 1)
# Kepp only the rows where the time difference is 30 minutes -> 30 min. window
joined_df = joined_df[joined_df[f'{time_col}_difference'] == 30]
# Get the difference between the first observation and last observation of 30 min. window
joined_df[f'{val_col}_difference'] = joined_df[f'{val_col}_end'] - joined_df[f'{val_col}_start']
# Appending N/A rows (i.e., rows which don't have any start times, hence no time and val differences)
if display_na_rows:
df_w_na_rows = df.rename(columns = {f'{val_col}': f'{val_col}_end', f'{time_col}':f'{time_col}_end'})
joined_df = pd.merge(df_w_na_rows,
joined_df.drop(f'{val_col}_end', axis = 1),
on = [id_col, f'{time_col}_end'],
how = 'left',
)
# Adjusting columns' and rows' orders
joined_df = (joined_df[[id_col, f'{time_col}_start', f'{time_col}_end', f'{time_col}_difference',
f'{val_col}_start', f'{val_col}_end', f'{val_col}_difference']]
.sort_values(by=[id_col, f'{time_col}_end'])
.reset_index(drop = True)
)
return joined_df
Assume that you have this data in the following format:
id | val | time |
---|---|---|
1 | 0.638129 | 2023-01-01 00:00:00 |
1 | 0.047401 | 2023-01-01 00:01:00 |
1 | 0.781685 | 2023-01-01 00:02:00 |
1 | 0.839184 | 2023-01-01 00:03:00 |
1 | 0.526894 | 2023-01-01 00:04:00 |
.. | ........ | .................. |
3 | 0.370669 | 2023-01-01 00:56:00 |
3 | 0.565692 | 2023-01-01 00:57:00 |
3 | 0.121998 | 2023-01-01 00:58:00 |
3 | 0.872352 | 2023-01-01 00:59:00 |
3 | 0.624171 | 2023-01-02 01:00:00 |
Where values for ID 1 in timestamps 00:00 and 00:30 are as follows:
id | val | time |
---|---|---|
1 | 0.638129 | 2023-01-01 00:00:00 |
1 | 0.184684 | 2023-01-01 00:30:00 |
When applying the function without displaying the rows with missing values (i.e., rows with no start times, hence no differences as well), we get difference between the value at time $t$ and the value at time $t-30$ for each ID's observation:
get_lagged_df(df, display_na_rows = False)
id | time_start | time_end | time_difference | val_start | val_end | val_difference |
---|---|---|---|---|---|---|
1 | 2023-01-01 00:00:00 | 2023-01-01 00:30:00 | 30 | 0.638129 | 0.184684 | -0.453445 |
1 | 2023-01-01 00:01:00 | 2023-01-01 00:31:00 | 30 | 0.047401 | 0.613936 | 0.566535 |
1 | 2023-01-01 00:02:00 | 2023-01-01 00:32:00 | 30 | 0.781685 | 0.865400 | 0.083714 |
1 | 2023-01-01 00:03:00 | 2023-01-01 00:33:00 | 30 | 0.839184 | 0.089480 | -0.749704 |
1 | 2023-01-01 00:04:00 | 2023-01-01 00:34:00 | 30 | 0.526894 | 0.541837 | 0.014943 |
.. | ... | ... | ... | ... | ... | ... |
3 | 2023-01-01 00:26:00 | 2023-01-01 00:56:00 | 30 | 0.812442 | 0.370669 | -0.441773 |
3 | 2023-01-01 00:27:00 | 2023-01-01 00:57:00 | 30 | 0.352933 | 0.565692 | 0.212759 |
3 | 2023-01-01 00:28:00 | 2023-01-01 00:58:00 | 30 | 0.186111 | 0.121998 | -0.064113 |
3 | 2023-01-01 00:29:00 | 2023-01-01 00:59:00 | 30 | 0.671967 | 0.872352 | 0.200385 |
3 | 2023-01-01 00:30:00 | 2023-01-02 01:00:00 | 30 | 0.008082 | 0.624171 | 0.616090 |
If we want to keep the rows with no start times as well, just change the display_na_rows
argument to True
:
get_lagged_df(df, display_na_rows = True)
id | time_start | time_end | time_difference | val_start | val_end | val_difference |
---|---|---|---|---|---|---|
1 | NaT | 2023-01-01 00:00:00 | NaN | NaN | 0.638129 | NaN |
1 | NaT | 2023-01-01 00:01:00 | NaN | NaN | 0.047401 | NaN |
1 | NaT | 2023-01-01 00:02:00 | NaN | NaN | 0.781685 | NaN |
1 | NaT | 2023-01-01 00:03:00 | NaN | NaN | 0.839184 | NaN |
1 | NaT | 2023-01-01 00:04:00 | NaN | NaN | 0.526894 | NaN |
.. | ... | ... | ... | ... | ... | ... |
3 | 2023-01-01 00:26:00 | 2023-01-01 00:56:00 | 30.0 | 0.812442 | 0.370669 | -0.441773 |
3 | 2023-01-01 00:27:00 | 2023-01-01 00:57:00 | 30.0 | 0.352933 | 0.565692 | 0.212759 |
3 | 2023-01-01 00:28:00 | 2023-01-01 00:58:00 | 30.0 | 0.186111 | 0.121998 | -0.064113 |
3 | 2023-01-01 00:29:00 | 2023-01-01 00:59:00 | 30.0 | 0.671967 | 0.872352 | 0.200385 |
3 | 2023-01-01 00:30:00 | 2023-01-02 01:00:00 | 30.0 | 0.008082 | 0.624171 | 0.616090 |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论