英文:
Calculate the time difference with next row based on a condition and grouping by ID in pandas
问题
df_input['timestamp'] = pd.to_datetime(df_input['timestamp'])
df_input['timestamp_next'] = df_input.groupby('ID')['timestamp'].shift(-1)
df_input['time_diff_next'] = (df_input['timestamp_next'] - df_input['timestamp']).dt.seconds
df_input = df_input[df_input['Condition'] > 0]
result = df_input[['timestamp_next', 'time_diff_next']]
英文:
The time difference with next row, grouping by ID, having condition>0 need to be calculated in pandas
input
ID timestamp Condition
aa 2023-1-5 06:33:27 23.33
aa 2023-1-5 06:33:33 13.26
aa 2023-1-5 06:33:39 5.71
aa 2023-1-5 06:33:45 0.00
aa 2023-1-5 06:33:51 0.00
aa 2023-1-5 06:33:57 0.00
aa 2023-1-5 06:46:15 0.00
aa 2023-1-5 06:46:21 0.00
aa 2023-1-5 06:46:27 2.18
aa 2023-1-5 06:46:33 0.00
aa 2023-1-5 06:46:39 4.10
aa 2023-1-5 06:46:45 21.73
aa 2023-1-5 06:46:51 33.79
Output:
timestamp_next | time_diff_next(seconds) |
1/5/2023 6:33:33 | 6 |
1/5/2023 6:33:39 | 6 |
1/5/2023 6:46:27 | 768 |
| |
| |
| |
||
||
1/5/2023 6:46:39 | 12 |
||
1/5/2023 6:46:45 | 6 |
1/5/2023 6:46:51 |6 |
1/5/2023 6:46:57 | 6 |
Sample code
df2=df_input[(df_input['Condition']>0)]
df2['timestamp']= pd.to_datetime(df2['timestamp'])
df2['timestamp_next']=df2.groupby("id")["timestamp"].shift(-1)
df2['time_diff_next']=(df2['timestamp_next']-df2['timestamp'])/timedelta(seconds=1)
df_input=df_input.merge(df2[['id','timestamp','timestamp_next','time_diff_next']],how='left',on=['id','timestamp'])
I need to implement this code without creating new dataframe df2 as in above code
答案1
得分: 2
df['time_diff_next'] = (df.mask(df['Condition'].eq(0)).groupby('ID')['timestamp']
.transform(lambda x: x.diff().dt.total_seconds().shift(-1)))
print(df)
输出
ID timestamp Condition time_diff_next
0 aa 2023-01-05 06:33:27 23.33 6.0
1 aa 2023-01-05 06:33:33 13.26 6.0
2 aa 2023-01-05 06:33:39 5.71 768.0
3 aa 2023-01-05 06:33:45 0.00 NaN
4 aa 2023-01-05 06:33:51 0.00 NaN
5 aa 2023-01-05 06:33:57 0.00 NaN
6 aa 2023-01-05 06:46:15 0.00 NaN
7 aa 2023-01-05 06:46:21 0.00 NaN
8 aa 2023-01-05 06:46:27 2.18 12.0
9 aa 2023-01-05 06:46:33 0.00 NaN
10 aa 2023-01-05 06:46:39 4.10 6.0
11 aa 2023-01-05 06:46:45 21.73 6.0
12 aa 2023-01-05 06:46:51 33.79 NaN
<details>
<summary>英文:</summary>
You can use:
df['time_diff_next'] = (df.mask(df['Condition'].eq(0)).groupby('ID')['timestamp']
.transform(lambda x: x.diff().dt.total_seconds().shift(-1)))
print(df)
Output
ID timestamp Condition time_diff_next
0 aa 2023-01-05 06:33:27 23.33 6.0
1 aa 2023-01-05 06:33:33 13.26 6.0
2 aa 2023-01-05 06:33:39 5.71 768.0
3 aa 2023-01-05 06:33:45 0.00 NaN
4 aa 2023-01-05 06:33:51 0.00 NaN
5 aa 2023-01-05 06:33:57 0.00 NaN
6 aa 2023-01-05 06:46:15 0.00 NaN
7 aa 2023-01-05 06:46:21 0.00 NaN
8 aa 2023-01-05 06:46:27 2.18 12.0
9 aa 2023-01-05 06:46:33 0.00 NaN
10 aa 2023-01-05 06:46:39 4.10 6.0
11 aa 2023-01-05 06:46:45 21.73 6.0
12 aa 2023-01-05 06:46:51 33.79 NaN
</details>
# 答案2
**得分**: 2
以下是翻译好的部分:
如果您需要按组计算`diff`或`shift`,您需要使用[`groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)来避免副作用。 在非空条件下,对过滤后的DataFrame使用[`groupby.diff`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.diff.html)似乎是合适的。
以下是在原始DataFrame中工作的建议:
```python
m = df['Condition'].gt(0)
df['time_diff_next(seconds)'] = (df[m].groupby('ID')['timestamp']
.diff().dt.total_seconds()
)
避免使用groupby
的另一种方法可能是使用pivot
和merge
,但我预计性能会相当差:
df.merge(df[m].pivot(index='timestamp', columns='ID', values='timestamp'
).diff().unstack().rename('time_diff_next(seconds)'),
left_on=['ID', 'timestamp'], right_index=True, how='left')
输出:
ID timestamp Condition time_diff_next(seconds)
0 aa 2023-01-05 06:33:27 23.33 NaN
1 aa 2023-01-05 06:33:33 13.26 6.0
2 aa 2023-01-05 06:33:39 5.71 6.0
3 aa 2023-01-05 06:33:45 0.00 NaN
4 aa 2023-01-05 06:33:51 0.00 NaN
5 aa 2023-01-05 06:33:57 0.00 NaN
6 aa 2023-01-05 06:46:15 0.00 NaN
7 aa 2023-01-05 06:46:21 0.00 NaN
8 aa 2023-01-05 06:46:27 2.18 768.0
9 aa 2023-01-05 06:46:33 0.00 NaN
10 aa 2023-01-05 06:46:39 4.10 12.0
11 aa 2023-01-05 06:46:45 21.73 6.0
12 aa 2023-01-05 06:46:51 33.79 6.0
注意:如果您想要得到diff
的移动版本,请使用diff(-1)
并对输出取反:
m = df['Condition'].gt(0)
df['time_diff_next(seconds)'] = (-df[m].groupby('ID')['timestamp']
.diff(-1).dt.total_seconds()
)
输出:
ID timestamp Condition time_diff_next(seconds)
0 aa 2023-01-05 06:33:27 23.33 6.0
1 aa 2023-01-05 06:33:33 13.26 6.0
2 aa 2023-01-05 06:33:39 5.71 768.0
3 aa 2023-01-05 06:33:45 0.00 NaN
4 aa 2023-01-05 06:33:51 0.00 NaN
5 aa 2023-01-05 06:33:57 0.00 NaN
6 aa 2023-01-05 06:46:15 0.00 NaN
7 aa 2023-01-05 06:46:21 0.00 NaN
8 aa 2023-01-05 06:46:27 2.18 12.0
9 aa 2023-01-05 06:46:33 0.00 NaN
10 aa 2023-01-05 06:46:39 4.10 6.0
11 aa 2023-01-05 06:46:45 21.73 6.0
12 aa 2023-01-05 06:46:51 33.79 NaN
英文:
If you have to compute diff
or shift
per group, you need to use a groupby
to avoid side effects. groupby.diff
on the filtered DataFrame for non-null Condition seems appropriate here.
Here is one suggestion to work in the original DataFrame:
m = df['Condition'].gt(0)
df['time_diff_next(seconds)'] = (df[m].groupby('ID')['timestamp']
.diff().dt.total_seconds()
)
Another approach to avoid a groupby
might be to pivot
and merge
but I expect performance to be quite bad:
df.merge(df[m].pivot(index='timestamp', columns='ID', values='timestamp'
).diff().unstack().rename('time_diff_next(seconds)'),
left_on=['ID', 'timestamp'], right_index=True, how='left')
Output:
ID timestamp Condition time_diff_next(seconds)
0 aa 2023-01-05 06:33:27 23.33 NaN
1 aa 2023-01-05 06:33:33 13.26 6.0
2 aa 2023-01-05 06:33:39 5.71 6.0
3 aa 2023-01-05 06:33:45 0.00 NaN
4 aa 2023-01-05 06:33:51 0.00 NaN
5 aa 2023-01-05 06:33:57 0.00 NaN
6 aa 2023-01-05 06:46:15 0.00 NaN
7 aa 2023-01-05 06:46:21 0.00 NaN
8 aa 2023-01-05 06:46:27 2.18 768.0
9 aa 2023-01-05 06:46:33 0.00 NaN
10 aa 2023-01-05 06:46:39 4.10 12.0
11 aa 2023-01-05 06:46:45 21.73 6.0
12 aa 2023-01-05 06:46:51 33.79 6.0
NB. if you want to get the diff shifter up, use diff(-1)
and negate the output:
m = df['Condition'].gt(0)
df['time_diff_next(seconds)'] = (-df[m].groupby('ID')['timestamp']
.diff(-1).dt.total_seconds()
)
Output:
ID timestamp Condition time_diff_next(seconds)
0 aa 2023-01-05 06:33:27 23.33 6.0
1 aa 2023-01-05 06:33:33 13.26 6.0
2 aa 2023-01-05 06:33:39 5.71 768.0
3 aa 2023-01-05 06:33:45 0.00 NaN
4 aa 2023-01-05 06:33:51 0.00 NaN
5 aa 2023-01-05 06:33:57 0.00 NaN
6 aa 2023-01-05 06:46:15 0.00 NaN
7 aa 2023-01-05 06:46:21 0.00 NaN
8 aa 2023-01-05 06:46:27 2.18 12.0
9 aa 2023-01-05 06:46:33 0.00 NaN
10 aa 2023-01-05 06:46:39 4.10 6.0
11 aa 2023-01-05 06:46:45 21.73 6.0
12 aa 2023-01-05 06:46:51 33.79 NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论