根据条件和按照ID分组,在pandas中计算与下一行的时间差。

huangapple go评论69阅读模式
英文:

Calculate the time difference with next row based on a condition and grouping by ID in pandas

问题

df_input['timestamp'] = pd.to_datetime(df_input['timestamp'])
df_input['timestamp_next'] = df_input.groupby('ID')['timestamp'].shift(-1)
df_input['time_diff_next'] = (df_input['timestamp_next'] - df_input['timestamp']).dt.seconds
df_input = df_input[df_input['Condition'] > 0]
result = df_input[['timestamp_next', 'time_diff_next']]
英文:

The time difference with next row, grouping by ID, having condition>0 need to be calculated in pandas

input

ID          timestamp  Condition
aa  2023-1-5 06:33:27      23.33
aa  2023-1-5 06:33:33      13.26
aa  2023-1-5 06:33:39       5.71
aa  2023-1-5 06:33:45       0.00
aa  2023-1-5 06:33:51       0.00
aa  2023-1-5 06:33:57       0.00
aa  2023-1-5 06:46:15       0.00
aa  2023-1-5 06:46:21       0.00
aa  2023-1-5 06:46:27       2.18
aa  2023-1-5 06:46:33       0.00
aa  2023-1-5 06:46:39       4.10
aa  2023-1-5 06:46:45      21.73
aa  2023-1-5 06:46:51      33.79

Output:

timestamp_next | time_diff_next(seconds) |

1/5/2023 6:33:33 | 6 |

1/5/2023 6:33:39 | 6 |

1/5/2023 6:46:27 | 768 |

| |

| |

| |

||

||

1/5/2023 6:46:39 | 12 |

||

1/5/2023 6:46:45 | 6 |

1/5/2023 6:46:51 |6 |

1/5/2023 6:46:57 | 6 |

Sample code


df2=df_input[(df_input['Condition']>0)]
       
df2['timestamp']= pd.to_datetime(df2['timestamp'])
df2['timestamp_next']=df2.groupby("id")["timestamp"].shift(-1)
df2['time_diff_next']=(df2['timestamp_next']-df2['timestamp'])/timedelta(seconds=1)

df_input=df_input.merge(df2[['id','timestamp','timestamp_next','time_diff_next']],how='left',on=['id','timestamp'])


I need to implement this code without creating new dataframe df2 as in above code

答案1

得分: 2

df['time_diff_next'] = (df.mask(df['Condition'].eq(0)).groupby('ID')['timestamp']
                          .transform(lambda x: x.diff().dt.total_seconds().shift(-1)))
print(df)

输出

ID           timestamp  Condition  time_diff_next

0 aa 2023-01-05 06:33:27 23.33 6.0
1 aa 2023-01-05 06:33:33 13.26 6.0
2 aa 2023-01-05 06:33:39 5.71 768.0
3 aa 2023-01-05 06:33:45 0.00 NaN
4 aa 2023-01-05 06:33:51 0.00 NaN
5 aa 2023-01-05 06:33:57 0.00 NaN
6 aa 2023-01-05 06:46:15 0.00 NaN
7 aa 2023-01-05 06:46:21 0.00 NaN
8 aa 2023-01-05 06:46:27 2.18 12.0
9 aa 2023-01-05 06:46:33 0.00 NaN
10 aa 2023-01-05 06:46:39 4.10 6.0
11 aa 2023-01-05 06:46:45 21.73 6.0
12 aa 2023-01-05 06:46:51 33.79 NaN


<details>
<summary>英文:</summary>

You can use:

df['time_diff_next'] = (df.mask(df['Condition'].eq(0)).groupby('ID')['timestamp']
.transform(lambda x: x.diff().dt.total_seconds().shift(-1)))
print(df)

Output

ID           timestamp  Condition  time_diff_next

0 aa 2023-01-05 06:33:27 23.33 6.0
1 aa 2023-01-05 06:33:33 13.26 6.0
2 aa 2023-01-05 06:33:39 5.71 768.0
3 aa 2023-01-05 06:33:45 0.00 NaN
4 aa 2023-01-05 06:33:51 0.00 NaN
5 aa 2023-01-05 06:33:57 0.00 NaN
6 aa 2023-01-05 06:46:15 0.00 NaN
7 aa 2023-01-05 06:46:21 0.00 NaN
8 aa 2023-01-05 06:46:27 2.18 12.0
9 aa 2023-01-05 06:46:33 0.00 NaN
10 aa 2023-01-05 06:46:39 4.10 6.0
11 aa 2023-01-05 06:46:45 21.73 6.0
12 aa 2023-01-05 06:46:51 33.79 NaN


</details>



# 答案2
**得分**: 2

以下是翻译好的部分:

如果您需要按组计算`diff`或`shift`,您需要使用[`groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)来避免副作用。 在非空条件下,对过滤后的DataFrame使用[`groupby.diff`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.diff.html)似乎是合适的。

以下是在原始DataFrame中工作的建议:

```python
m = df['Condition'].gt(0)
df['time_diff_next(seconds)'] = (df[m].groupby('ID')['timestamp']
                                 .diff().dt.total_seconds()
                                )

避免使用groupby的另一种方法可能是使用pivotmerge,但我预计性能会相当差:

df.merge(df[m].pivot(index='timestamp', columns='ID', values='timestamp'
                    ).diff().unstack().rename('time_diff_next(seconds)'),
         left_on=['ID', 'timestamp'], right_index=True, how='left')

输出:

    ID           timestamp  Condition  time_diff_next(seconds)
0   aa 2023-01-05 06:33:27      23.33                      NaN
1   aa 2023-01-05 06:33:33      13.26                      6.0
2   aa 2023-01-05 06:33:39       5.71                      6.0
3   aa 2023-01-05 06:33:45       0.00                      NaN
4   aa 2023-01-05 06:33:51       0.00                      NaN
5   aa 2023-01-05 06:33:57       0.00                      NaN
6   aa 2023-01-05 06:46:15       0.00                      NaN
7   aa 2023-01-05 06:46:21       0.00                      NaN
8   aa 2023-01-05 06:46:27       2.18                    768.0
9   aa 2023-01-05 06:46:33       0.00                      NaN
10  aa 2023-01-05 06:46:39       4.10                     12.0
11  aa 2023-01-05 06:46:45      21.73                      6.0
12  aa 2023-01-05 06:46:51      33.79                      6.0

注意:如果您想要得到diff的移动版本,请使用diff(-1)并对输出取反:

m = df['Condition'].gt(0)
df['time_diff_next(seconds)'] = (-df[m].groupby('ID')['timestamp']
                                 .diff(-1).dt.total_seconds()
                                 )

输出:

    ID           timestamp  Condition  time_diff_next(seconds)
0   aa 2023-01-05 06:33:27      23.33                      6.0
1   aa 2023-01-05 06:33:33      13.26                      6.0
2   aa 2023-01-05 06:33:39       5.71                    768.0
3   aa 2023-01-05 06:33:45       0.00                      NaN
4   aa 2023-01-05 06:33:51       0.00                      NaN
5   aa 2023-01-05 06:33:57       0.00                      NaN
6   aa 2023-01-05 06:46:15       0.00                      NaN
7   aa 2023-01-05 06:46:21       0.00                      NaN
8   aa 2023-01-05 06:46:27       2.18                     12.0
9   aa 2023-01-05 06:46:33       0.00                      NaN
10  aa 2023-01-05 06:46:39       4.10                      6.0
11  aa 2023-01-05 06:46:45      21.73                      6.0
12  aa 2023-01-05 06:46:51      33.79                      NaN
英文:

If you have to compute diff or shift per group, you need to use a groupby to avoid side effects. groupby.diff on the filtered DataFrame for non-null Condition seems appropriate here.

Here is one suggestion to work in the original DataFrame:

m = df[&#39;Condition&#39;].gt(0)
df[&#39;time_diff_next(seconds)&#39;] = (df[m].groupby(&#39;ID&#39;)[&#39;timestamp&#39;]
                                 .diff().dt.total_seconds()
                                )

Another approach to avoid a groupby might be to pivot and merge but I expect performance to be quite bad:

df.merge(df[m].pivot(index=&#39;timestamp&#39;, columns=&#39;ID&#39;, values=&#39;timestamp&#39;
                    ).diff().unstack().rename(&#39;time_diff_next(seconds)&#39;),
         left_on=[&#39;ID&#39;, &#39;timestamp&#39;], right_index=True, how=&#39;left&#39;)     

Output:

    ID           timestamp  Condition  time_diff_next(seconds)
0   aa 2023-01-05 06:33:27      23.33                      NaN
1   aa 2023-01-05 06:33:33      13.26                      6.0
2   aa 2023-01-05 06:33:39       5.71                      6.0
3   aa 2023-01-05 06:33:45       0.00                      NaN
4   aa 2023-01-05 06:33:51       0.00                      NaN
5   aa 2023-01-05 06:33:57       0.00                      NaN
6   aa 2023-01-05 06:46:15       0.00                      NaN
7   aa 2023-01-05 06:46:21       0.00                      NaN
8   aa 2023-01-05 06:46:27       2.18                    768.0
9   aa 2023-01-05 06:46:33       0.00                      NaN
10  aa 2023-01-05 06:46:39       4.10                     12.0
11  aa 2023-01-05 06:46:45      21.73                      6.0
12  aa 2023-01-05 06:46:51      33.79                      6.0

NB. if you want to get the diff shifter up, use diff(-1) and negate the output:

m = df[&#39;Condition&#39;].gt(0)
df[&#39;time_diff_next(seconds)&#39;] = (-df[m].groupby(&#39;ID&#39;)[&#39;timestamp&#39;]
                                 .diff(-1).dt.total_seconds()
                                 )

Output:

    ID           timestamp  Condition  time_diff_next(seconds)
0   aa 2023-01-05 06:33:27      23.33                      6.0
1   aa 2023-01-05 06:33:33      13.26                      6.0
2   aa 2023-01-05 06:33:39       5.71                    768.0
3   aa 2023-01-05 06:33:45       0.00                      NaN
4   aa 2023-01-05 06:33:51       0.00                      NaN
5   aa 2023-01-05 06:33:57       0.00                      NaN
6   aa 2023-01-05 06:46:15       0.00                      NaN
7   aa 2023-01-05 06:46:21       0.00                      NaN
8   aa 2023-01-05 06:46:27       2.18                     12.0
9   aa 2023-01-05 06:46:33       0.00                      NaN
10  aa 2023-01-05 06:46:39       4.10                      6.0
11  aa 2023-01-05 06:46:45      21.73                      6.0
12  aa 2023-01-05 06:46:51      33.79                      NaN

huangapple
  • 本文由 发表于 2023年2月8日 18:56:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/75384758.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定