英文:
How to take difference between last value and first value of 30 min window, and keep other columns as well?
问题
Parquet:
示例数据集链接
CSV 格式:
示例数据集链接
更新后的数据集: 新数据集链接
预期结果应该包括在30分钟采样窗口内最后一个值和第一个值之间的差异结果(如上所示的示例数据),即:
项目          值     代理状态    原始时间戳                                    代理时间戳
Channel1.Device1.Tag1    847    正常      2023-07-28T13:09:00.0098328+09:00    2023-07-28T13:09:00.0000000
Channel1.Device1.Tag2    0      正常      2023-07-28T13:09:00.0408696+09:00    2023-07-28T13:09:00.0000000
Channel1.Device1.Tag1    848    正常      2023-07-28T13:09:05.0138770+09:00    2023-07-28T13:09:05.0000000
...
预期结果/计算:
在30分钟内采样后,结果将包含来自这两个项目的数据,如下所示:
- 对于每个项目,例如:对于 Channel1.Device1.Tag2:
最后一个值 - 30分钟窗口的第一个值,即: 
最后一行的值(878):
39  Channel1.Device1.Tag2   878 坏    2023-07-28T13:30:15.0338704+09:08    2023-07-28T13:30:15.0000000 
第一行的值(0):
1   Channel1.Device1.Tag2   0   正常   2023-07-28T13:09:00.0408696+09:00    2023-07-28T13:09:00.0000000
预期结果/数据框:
如图所示,值878和3对应于Channel1.Device1.Tag2和Channel1.Device1.Tag1,取(最后一个值 - 第一个值)之间的差值:(878 - 0) 和 (850-847),而其他列的值保持不变。
更新:
新数据集经过了来自mozway的提供的解决方案,结果有轻微偏差(而不是取00:30:00 - 00:00:00之间的差值),因为每个项目标签名称都有重复的时间戳值。如何解决这个问题?或者如何跳过该行中的重复值。
我必须添加这张图片,以解释:
英文:
Parquet:
Link to the sample dataset 
CSV formatted:
Link to the sample dataset \
Updated dataset: Link to new dataset
The expected result should have the result of difference between last value and first value of the 30 min sample window (From the sample data shown above), that is:
	Item	Value	AgentStatus	OriginalTimestamp	AgentTimeStamp
0	Channel1.Device1.Tag1	847	good	2023-07-28T13:09:00.0098328+09:00	2023-07-28T13:09:00.0000000
1	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:00.0408696+09:00	2023-07-28T13:09:00.0000000
2	Channel1.Device1.Tag1	848	good	2023-07-28T13:09:05.0138770+09:00	2023-07-28T13:09:05.0000000
3	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:05.0454734+09:00	2023-07-28T13:09:05.0000000
4	Channel1.Device1.Tag1	849	good	2023-07-28T13:09:10.0073605+09:00	2023-07-28T13:09:10.0000000
5	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:10.0379516+09:00	2023-07-28T13:09:10.0000000
6	Channel1.Device1.Tag1	850	good	2023-07-28T13:09:15.0074263+09:00	2023-07-28T13:09:15.0000000
7	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:15.0387691+09:00	2023-07-28T13:09:15.0000000
8	Channel1.Device1.Tag1	851	good	2023-07-28T13:09:20.0176840+09:00	2023-07-28T13:09:20.0000000
9	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:20.0329268+09:00	2023-07-28T13:09:20.0000000
10	Channel1.Device1.Tag1	852	good	2023-07-28T13:09:25.0070191+09:00	2023-07-28T13:09:25.0000000
11	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:25.0384699+09:00	2023-07-28T13:09:25.0000000
12	Channel1.Device1.Tag1	853	good	2023-07-28T13:09:30.0109244+09:00	2023-07-28T13:09:30.0000000
13	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:30.0417249+09:00	2023-07-28T13:09:30.0000000
14	Channel1.Device1.Tag1	854	good	2023-07-28T13:09:35.0118763+09:00	2023-07-28T13:09:35.0000000
15	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:35.0429050+09:00	2023-07-28T13:09:35.0000000
16	Channel1.Device1.Tag1	855	good	2023-07-28T13:09:40.0027594+09:00	2023-07-28T13:09:40.0000000
17	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:40.0340476+09:00	2023-07-28T13:09:40.0000000
18	Channel1.Device1.Tag1	856	good	2023-07-28T13:09:45.0029277+09:00	2023-07-28T13:09:45.0000000
19	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:45.0336946+09:00	2023-07-28T13:09:45.0000000
20	Channel1.Device1.Tag1	857	good	2023-07-28T13:09:50.0153041+09:00	2023-07-28T13:09:50.0000000
21	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:50.0459796+09:00	2023-07-28T13:09:50.0000000
22	Channel1.Device1.Tag1	858	good	2023-07-28T13:09:55.0103680+09:00	2023-07-28T13:09:55.0000000
23	Channel1.Device1.Tag2	0	good	2023-07-28T13:09:55.0412343+09:00	2023-07-28T13:09:55.0000000
24	Channel1.Device1.Tag1	859	good	2023-07-28T13:10:00.0095407+09:00	2023-07-28T13:10:00.0000000
25	Channel1.Device1.Tag2	0	good	2023-07-28T13:10:00.0395870+09:00	2023-07-28T13:10:00.0000000
26	Channel1.Device1.Tag1	860	good	2023-07-28T13:10:05.0069727+09:00	2023-07-28T13:10:05.0000000
27	Channel1.Device1.Tag2	0	good	2023-07-28T13:10:05.0374699+09:00	2023-07-28T13:10:05.0000000
28	Channel1.Device1.Tag1	861	good	2023-07-28T13:10:10.0113827+09:00	2023-07-28T13:10:10.0000000
29	Channel1.Device1.Tag2	0	good	2023-07-28T13:10:10.0431140+09:00	2023-07-28T13:10:10.0000000
30	Channel1.Device1.Tag1	862	good	2023-07-28T13:10:15.0024582+09:00	2023-07-28T13:10:15.0000000
31	Channel1.Device1.Tag2	0	good	2023-07-28T13:10:15.0338704+09:00	2023-07-28T13:10:15.0000000
32	Channel1.Device1.Tag2	0	good	2023-07-28T13:11:15.0338704+09:01	2023-07-28T13:11:15.0000000
33	Channel1.Device1.Tag2	0	good	2023-07-28T13:12:15.0338704+09:02	2023-07-28T13:12:15.0000000
34	Channel1.Device1.Tag2	0	good	2023-07-28T13:15:15.0338704+09:03	2023-07-28T13:15:15.0000000
35	Channel1.Device1.Tag2	0	good	2023-07-28T13:20:15.0338704+09:04	2023-07-28T13:20:15.0000000
36	Channel1.Device1.Tag2	0	good	2023-07-28T13:21:15.0338704+09:05	2023-07-28T13:21:15.0000000
37	Channel1.Device1.Tag2	0	good	2023-07-28T13:22:15.0338704+09:06	2023-07-28T13:22:15.0000000
38	Channel1.Device1.Tag2	0	good	2023-07-28T13:25:15.0338704+09:07	2023-07-28T13:25:15.0000000
39	Channel1.Device1.Tag2	878	bad	2023-07-28T13:30:15.0338704+09:08	2023-07-28T13:30:15.0000000
40	Channel1.Device1.Tag2	0	good	2023-07-28T13:31:15.0338704+09:09	2023-07-28T13:31:15.0000000
Expected result/calculation:
After sampling within 30 mins, the result will contain data from these two items as shown:

- For each Item, Example: For Channel1.Device1.Tag2:
Last Value - First Value of the 30 minute windows, i.e: 
Last Row Value (878):
39  Channel1.Device1.Tag2   878 bad 2023-07-28T13:30:15.0338704+09:08   2023-07-28T13:30:15.0000000 
First row value (0):
1   Channel1.Device1.Tag2   0   good    2023-07-28T13:09:00.0408696+09:00   2023-07-28T13:09:00.0000000
Expected result/ dataframe:
As seen, the Value 878 and 3 corresponds to Channel1.Device1.Tag2 and Channel1.Device1.Tag1, taking the difference between (last value - first value): (878 - 0) and (850-847), whereas other column values are retained.
Update:
The new dataset was run through the provided solution from mozway, and the results has a slight deviation (Instead of taking diff b/w 00:30:00 - 00:00:00) because of the duplicate value of timestamp against each item tag name. How to resolve this ? Or How to skip that duplicate value from the row.
答案1
得分: 1
以下是已经翻译好的代码部分:
r = df.groupby('Item').resample('30T', on='AgentTimeStamp', origin='start', label='right')
out = (r.last()
        .assign(Value=lambda d: d[['Value']].sub(r[['Value']].first()))
        .reset_index('AgentTimeStamp')[df.columns]
      )
Output:
                       index                   Item  Value AgentStatus      AgentTimeStamp
Item                                                                                      
Channel1.Device1.Tag1      2  Channel1.Device1.Tag1      1        good 2023-07-28 13:39:00
Channel1.Device1.Tag2     39  Channel1.Device1.Tag2    878         bad 2023-07-28 13:39:00
英文:
What you want is unclear, maybe you can combine groupby and resample:
r = df.groupby('Item').resample('30T', on='AgentTimeStamp', origin='start', label='right')
out = (r.last()
        .assign(Value=lambda d: d[['Value']].sub(r[['Value']].first()))
        .reset_index('AgentTimeStamp')[df.columns]
      )
Output:
                       index                   Item  Value AgentStatus      AgentTimeStamp
Item                                                                                      
Channel1.Device1.Tag1      2  Channel1.Device1.Tag1      1        good 2023-07-28 13:39:00
Channel1.Device1.Tag2     39  Channel1.Device1.Tag2    878         bad 2023-07-28 13:39:00
答案2
得分: 0
以下是翻译好的部分:
假设您希望每个项目的观察都有30分钟的时间差,您可以使用自连接来实现。这个函数适用于我的情况,所以我希望它能帮助您:
def get_lagged_df(df: pd.DataFrame,
                  id_col: str ='id',
                  time_col: str = 'time',
                  val_col: str = 'val',
                  display_na_rows: bool = True) -> pd.DataFrame:
# 基于ID进行自连接
joined_df = pd.merge(df, df, on = id_col, suffixes = ('_end', '_start'))
# 仅保留结束时间在开始时间之上的行
joined_df = joined_df[joined_df[f'{time_col}_end'] > joined_df[f'{time_col}_start']]
# 用于获取结束时间和开始时间之间的差异(以分钟为单位)的Lambda函数
lambda_func = lambda row: int((pd.to_datetime(row[f'{time_col}_end']) - pd.to_datetime(row[f'{time_col}_start']))
                              .seconds / 60)
# 获取结束时间和开始时间之间的时间差(以分钟为单位)
joined_df[f'{time_col}_difference'] = joined_df.apply(lambda_func, axis = 1)
# 仅保留时间差为30分钟的行 -> 30分钟窗口
joined_df = joined_df[joined_df[f'{time_col}_difference'] == 30]
# 获取30分钟窗口的第一个观察和最后一个观察之间的差异
joined_df[f'{val_col}_difference'] = joined_df[f'{val_col}_end'] - joined_df[f'{val_col}_start']
# 添加N/A行(即,没有任何开始时间的行,因此也没有时间和值差异)
if display_na_rows:
    df_w_na_rows = df.rename(columns = {f'{val_col}': f'{val_col}_end', f'{time_col}':f'{time_col}_end'})
    joined_df = pd.merge(df_w_na_rows,
                         joined_df.drop(f'{val_col}_end', axis = 1),
                         on = [id_col, f'{time_col}_end'],
                         how = 'left',
                        )
# 调整列和行的顺序
joined_df = (joined_df[[id_col, f'{time_col}_start', f'{time_col}_end', f'{time_col}_difference',
                        f'{val_col}_start', f'{val_col}_end', f'{val_col}_difference']]
             .sort_values(by=[id_col, f'{time_col}_end'])
             .reset_index(drop = True)
            )
return joined_df
假设您的数据格式如下:
| id | val | time | 
|---|---|---|
| 1 | 0.638129 | 2023-01-01 00:00:00 | 
| 1 | 0.047401 | 2023-01-01 00:01:00 | 
| 1 | 0.781685 | 2023-01-01 00:02:00 | 
| 1 | 0.839184 | 2023-01-01 00:03:00 | 
| 1 | 0.526894 | 2023-01-01 00:04:00 | 
| .. | ........ | .................. | 
| 3 | 0.370669 | 2023-01-01 00:56:00 | 
| 3 | 0.565692 | 2023-01-01 00:57:00 | 
| 3 | 0.121998 | 2023-01-01 00:58:00 | 
| 3 | 0.872352 | 2023-01-01 00:59:00 | 
| 3 | 0.624171 | 2023-01-02 01:00:00 | 
其中ID 1在时间戳00:00和00:30的值如下:
| id | val | time | 
|---|---|---|
| 1 | 0.638129 | 2023-01-01 00:00:00 | 
| 1 | 0.184684 | 2023-01-01 00:30:00 | 
如果在不显示具有缺失值的行(即,没有开始时间的行,因此也没有差异)的情况下应用该函数,我们将得到每个ID观察的时间$t$和时间$t-30$之间值的差异:
get_lagged_df(df, display_na_rows = False)
如果想要保留没有开始时间的行,只需将display_na_rows参数更改为True:
get_lagged_df(df, display_na_rows = True)
英文:
Assuming that you want a 30-minutes window difference for each item's observation, you can achieve it using self join. This function works for my case, so I hope it'll help:
def get_lagged_df(df: pd.DataFrame,
id_col: str ='id',
time_col: str = 'time',
val_col: str = 'val',
display_na_rows: bool = True) -> pd.DataFrame:
# Self join based on the ID
joined_df = pd.merge(df, df, on = id_col, suffixes = ('_end', '_start'))
# Keep only the rows where end time is above the start time
joined_df = joined_df[joined_df[f'{time_col}_end'] > joined_df[f'{time_col}_start']]
# Lambda function for getting differences between end time and start time in minutes
lambda_func = lambda row: int((pd.to_datetime(row[f'{time_col}_end']) - pd.to_datetime(row[f'{time_col}_start']))
.seconds / 60)
# Get the time differences in minutes between end time and start time
joined_df[f'{time_col}_difference'] = joined_df.apply(lambda_func, axis = 1)
# Kepp only the rows where the time difference is 30 minutes -> 30 min. window
joined_df = joined_df[joined_df[f'{time_col}_difference'] == 30]
# Get the difference between the first observation and last observation of 30 min. window
joined_df[f'{val_col}_difference'] = joined_df[f'{val_col}_end'] - joined_df[f'{val_col}_start']
# Appending N/A rows (i.e., rows which don't have any start times, hence no time and val differences)
if display_na_rows:
df_w_na_rows = df.rename(columns = {f'{val_col}': f'{val_col}_end', f'{time_col}':f'{time_col}_end'})
joined_df = pd.merge(df_w_na_rows,
joined_df.drop(f'{val_col}_end', axis = 1),
on = [id_col, f'{time_col}_end'],
how = 'left',
)
# Adjusting columns' and rows' orders
joined_df = (joined_df[[id_col, f'{time_col}_start', f'{time_col}_end', f'{time_col}_difference',
f'{val_col}_start', f'{val_col}_end', f'{val_col}_difference']]
.sort_values(by=[id_col, f'{time_col}_end'])
.reset_index(drop = True)
)
return joined_df
Assume that you have this data in the following format:
| id | val | time | 
|---|---|---|
| 1 | 0.638129 | 2023-01-01 00:00:00 | 
| 1 | 0.047401 | 2023-01-01 00:01:00 | 
| 1 | 0.781685 | 2023-01-01 00:02:00 | 
| 1 | 0.839184 | 2023-01-01 00:03:00 | 
| 1 | 0.526894 | 2023-01-01 00:04:00 | 
| .. | ........ | .................. | 
| 3 | 0.370669 | 2023-01-01 00:56:00 | 
| 3 | 0.565692 | 2023-01-01 00:57:00 | 
| 3 | 0.121998 | 2023-01-01 00:58:00 | 
| 3 | 0.872352 | 2023-01-01 00:59:00 | 
| 3 | 0.624171 | 2023-01-02 01:00:00 | 
Where values for ID 1 in timestamps 00:00 and 00:30 are as follows:
| id | val | time | 
|---|---|---|
| 1 | 0.638129 | 2023-01-01 00:00:00 | 
| 1 | 0.184684 | 2023-01-01 00:30:00 | 
When applying the function without displaying the rows with missing values (i.e., rows with no start times, hence no differences as well), we get difference between the value at time $t$ and the value at time $t-30$ for each ID's observation:
get_lagged_df(df, display_na_rows = False)
| id | time_start | time_end | time_difference | val_start | val_end | val_difference | 
|---|---|---|---|---|---|---|
| 1 | 2023-01-01 00:00:00 | 2023-01-01 00:30:00 | 30 | 0.638129 | 0.184684 | -0.453445 | 
| 1 | 2023-01-01 00:01:00 | 2023-01-01 00:31:00 | 30 | 0.047401 | 0.613936 | 0.566535 | 
| 1 | 2023-01-01 00:02:00 | 2023-01-01 00:32:00 | 30 | 0.781685 | 0.865400 | 0.083714 | 
| 1 | 2023-01-01 00:03:00 | 2023-01-01 00:33:00 | 30 | 0.839184 | 0.089480 | -0.749704 | 
| 1 | 2023-01-01 00:04:00 | 2023-01-01 00:34:00 | 30 | 0.526894 | 0.541837 | 0.014943 | 
| .. | ... | ... | ... | ... | ... | ... | 
| 3 | 2023-01-01 00:26:00 | 2023-01-01 00:56:00 | 30 | 0.812442 | 0.370669 | -0.441773 | 
| 3 | 2023-01-01 00:27:00 | 2023-01-01 00:57:00 | 30 | 0.352933 | 0.565692 | 0.212759 | 
| 3 | 2023-01-01 00:28:00 | 2023-01-01 00:58:00 | 30 | 0.186111 | 0.121998 | -0.064113 | 
| 3 | 2023-01-01 00:29:00 | 2023-01-01 00:59:00 | 30 | 0.671967 | 0.872352 | 0.200385 | 
| 3 | 2023-01-01 00:30:00 | 2023-01-02 01:00:00 | 30 | 0.008082 | 0.624171 | 0.616090 | 
If we want to keep the rows with no start times as well, just change the display_na_rows argument to True:
get_lagged_df(df, display_na_rows = True)
| id | time_start | time_end | time_difference | val_start | val_end | val_difference | 
|---|---|---|---|---|---|---|
| 1 | NaT | 2023-01-01 00:00:00 | NaN | NaN | 0.638129 | NaN | 
| 1 | NaT | 2023-01-01 00:01:00 | NaN | NaN | 0.047401 | NaN | 
| 1 | NaT | 2023-01-01 00:02:00 | NaN | NaN | 0.781685 | NaN | 
| 1 | NaT | 2023-01-01 00:03:00 | NaN | NaN | 0.839184 | NaN | 
| 1 | NaT | 2023-01-01 00:04:00 | NaN | NaN | 0.526894 | NaN | 
| .. | ... | ... | ... | ... | ... | ... | 
| 3 | 2023-01-01 00:26:00 | 2023-01-01 00:56:00 | 30.0 | 0.812442 | 0.370669 | -0.441773 | 
| 3 | 2023-01-01 00:27:00 | 2023-01-01 00:57:00 | 30.0 | 0.352933 | 0.565692 | 0.212759 | 
| 3 | 2023-01-01 00:28:00 | 2023-01-01 00:58:00 | 30.0 | 0.186111 | 0.121998 | -0.064113 | 
| 3 | 2023-01-01 00:29:00 | 2023-01-01 00:59:00 | 30.0 | 0.671967 | 0.872352 | 0.200385 | 
| 3 | 2023-01-01 00:30:00 | 2023-01-02 01:00:00 | 30.0 | 0.008082 | 0.624171 | 0.616090 | 
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。




评论