2023年2月8日 18:56:08go评论69阅读模式

英文:

Calculate the time difference with next row based on a condition and grouping by ID in pandas

问题

df_input['timestamp'] = pd.to_datetime(df_input['timestamp'])
df_input['timestamp_next'] = df_input.groupby('ID')['timestamp'].shift(-1)
df_input['time_diff_next'] = (df_input['timestamp_next'] - df_input['timestamp']).dt.seconds
df_input = df_input[df_input['Condition'] > 0]
result = df_input[['timestamp_next', 'time_diff_next']]

英文:

The time difference with next row, grouping by ID, having condition>0 need to be calculated in pandas

input

ID          timestamp  Condition
aa  2023-1-5 06:33:27      23.33
aa  2023-1-5 06:33:33      13.26
aa  2023-1-5 06:33:39       5.71
aa  2023-1-5 06:33:45       0.00
aa  2023-1-5 06:33:51       0.00
aa  2023-1-5 06:33:57       0.00
aa  2023-1-5 06:46:15       0.00
aa  2023-1-5 06:46:21       0.00
aa  2023-1-5 06:46:27       2.18
aa  2023-1-5 06:46:33       0.00
aa  2023-1-5 06:46:39       4.10
aa  2023-1-5 06:46:45      21.73
aa  2023-1-5 06:46:51      33.79

Output:

timestamp_next | time_diff_next(seconds) |

1/5/2023 6:33:33 | 6 |

1/5/2023 6:33:39 | 6 |

1/5/2023 6:46:27 | 768 |

| |

| |

| |

||

||

1/5/2023 6:46:39 | 12 |

||

1/5/2023 6:46:45 | 6 |

1/5/2023 6:46:51 |6 |

1/5/2023 6:46:57 | 6 |

Sample code


df2=df_input[(df_input[&#39;Condition&#39;]&gt;0)]
       
df2[&#39;timestamp&#39;]= pd.to_datetime(df2[&#39;timestamp&#39;])
df2[&#39;timestamp_next&#39;]=df2.groupby(&quot;id&quot;)[&quot;timestamp&quot;].shift(-1)
df2[&#39;time_diff_next&#39;]=(df2[&#39;timestamp_next&#39;]-df2[&#39;timestamp&#39;])/timedelta(seconds=1)

df_input=df_input.merge(df2[[&#39;id&#39;,&#39;timestamp&#39;,&#39;timestamp_next&#39;,&#39;time_diff_next&#39;]],how=&#39;left&#39;,on=[&#39;id&#39;,&#39;timestamp&#39;])

I need to implement this code without creating new dataframe df2 as in above code

答案1

得分: 2

df['time_diff_next'] = (df.mask(df['Condition'].eq(0)).groupby('ID')['timestamp']
                          .transform(lambda x: x.diff().dt.total_seconds().shift(-1)))
print(df)

输出

ID           timestamp  Condition  time_diff_next

0 aa 2023-01-05 06:33:27 23.33 6.0
1 aa 2023-01-05 06:33:33 13.26 6.0
2 aa 2023-01-05 06:33:39 5.71 768.0
3 aa 2023-01-05 06:33:45 0.00 NaN
4 aa 2023-01-05 06:33:51 0.00 NaN
5 aa 2023-01-05 06:33:57 0.00 NaN
6 aa 2023-01-05 06:46:15 0.00 NaN
7 aa 2023-01-05 06:46:21 0.00 NaN
8 aa 2023-01-05 06:46:27 2.18 12.0
9 aa 2023-01-05 06:46:33 0.00 NaN
10 aa 2023-01-05 06:46:39 4.10 6.0
11 aa 2023-01-05 06:46:45 21.73 6.0
12 aa 2023-01-05 06:46:51 33.79 NaN


<details>
<summary>英文:</summary>

You can use:

df['time_diff_next'] = (df.mask(df['Condition'].eq(0)).groupby('ID')['timestamp']
.transform(lambda x: x.diff().dt.total_seconds().shift(-1)))
print(df)

Output

ID           timestamp  Condition  time_diff_next


</details>



# 答案2
**得分**: 2

以下是翻译好的部分：

如果您需要按组计算`diff`或`shift`，您需要使用[`groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)来避免副作用。 在非空条件下，对过滤后的DataFrame使用[`groupby.diff`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.diff.html)似乎是合适的。

以下是在原始DataFrame中工作的建议：

```python
m = df['Condition'].gt(0)
df['time_diff_next(seconds)'] = (df[m].groupby('ID')['timestamp']
                                 .diff().dt.total_seconds()
                                )

避免使用groupby的另一种方法可能是使用pivot和merge，但我预计性能会相当差：

df.merge(df[m].pivot(index='timestamp', columns='ID', values='timestamp'
                    ).diff().unstack().rename('time_diff_next(seconds)'),
         left_on=['ID', 'timestamp'], right_index=True, how='left')

输出：

    ID           timestamp  Condition  time_diff_next(seconds)
0   aa 2023-01-05 06:33:27      23.33                      NaN
1   aa 2023-01-05 06:33:33      13.26                      6.0
2   aa 2023-01-05 06:33:39       5.71                      6.0
3   aa 2023-01-05 06:33:45       0.00                      NaN
4   aa 2023-01-05 06:33:51       0.00                      NaN
5   aa 2023-01-05 06:33:57       0.00                      NaN
6   aa 2023-01-05 06:46:15       0.00                      NaN
7   aa 2023-01-05 06:46:21       0.00                      NaN
8   aa 2023-01-05 06:46:27       2.18                    768.0
9   aa 2023-01-05 06:46:33       0.00                      NaN
10  aa 2023-01-05 06:46:39       4.10                     12.0
11  aa 2023-01-05 06:46:45      21.73                      6.0
12  aa 2023-01-05 06:46:51      33.79                      6.0

注意：如果您想要得到diff的移动版本，请使用diff(-1)并对输出取反：

m = df['Condition'].gt(0)
df['time_diff_next(seconds)'] = (-df[m].groupby('ID')['timestamp']
                                 .diff(-1).dt.total_seconds()
                                 )

输出：

    ID           timestamp  Condition  time_diff_next(seconds)
0   aa 2023-01-05 06:33:27      23.33                      6.0
1   aa 2023-01-05 06:33:33      13.26                      6.0
2   aa 2023-01-05 06:33:39       5.71                    768.0
3   aa 2023-01-05 06:33:45       0.00                      NaN
4   aa 2023-01-05 06:33:51       0.00                      NaN
5   aa 2023-01-05 06:33:57       0.00                      NaN
6   aa 2023-01-05 06:46:15       0.00                      NaN
7   aa 2023-01-05 06:46:21       0.00                      NaN
8   aa 2023-01-05 06:46:27       2.18                     12.0
9   aa 2023-01-05 06:46:33       0.00                      NaN
10  aa 2023-01-05 06:46:39       4.10                      6.0
11  aa 2023-01-05 06:46:45      21.73                      6.0
12  aa 2023-01-05 06:46:51      33.79                      NaN

英文:

If you have to compute diff or shift per group, you need to use a groupby to avoid side effects. groupby.diff on the filtered DataFrame for non-null Condition seems appropriate here.

Here is one suggestion to work in the original DataFrame:

m = df[&#39;Condition&#39;].gt(0)
df[&#39;time_diff_next(seconds)&#39;] = (df[m].groupby(&#39;ID&#39;)[&#39;timestamp&#39;]
                                 .diff().dt.total_seconds()
                                )

Another approach to avoid a groupby might be to pivot and merge but I expect performance to be quite bad:

df.merge(df[m].pivot(index=&#39;timestamp&#39;, columns=&#39;ID&#39;, values=&#39;timestamp&#39;
                    ).diff().unstack().rename(&#39;time_diff_next(seconds)&#39;),
         left_on=[&#39;ID&#39;, &#39;timestamp&#39;], right_index=True, how=&#39;left&#39;)

Output:

    ID           timestamp  Condition  time_diff_next(seconds)
0   aa 2023-01-05 06:33:27      23.33                      NaN
1   aa 2023-01-05 06:33:33      13.26                      6.0
2   aa 2023-01-05 06:33:39       5.71                      6.0
3   aa 2023-01-05 06:33:45       0.00                      NaN
4   aa 2023-01-05 06:33:51       0.00                      NaN
5   aa 2023-01-05 06:33:57       0.00                      NaN
6   aa 2023-01-05 06:46:15       0.00                      NaN
7   aa 2023-01-05 06:46:21       0.00                      NaN
8   aa 2023-01-05 06:46:27       2.18                    768.0
9   aa 2023-01-05 06:46:33       0.00                      NaN
10  aa 2023-01-05 06:46:39       4.10                     12.0
11  aa 2023-01-05 06:46:45      21.73                      6.0
12  aa 2023-01-05 06:46:51      33.79                      6.0

NB. if you want to get the diff shifter up, use diff(-1) and negate the output:

m = df[&#39;Condition&#39;].gt(0)
df[&#39;time_diff_next(seconds)&#39;] = (-df[m].groupby(&#39;ID&#39;)[&#39;timestamp&#39;]
                                 .diff(-1).dt.total_seconds()
                                 )

Output:

    ID           timestamp  Condition  time_diff_next(seconds)
0   aa 2023-01-05 06:33:27      23.33                      6.0
1   aa 2023-01-05 06:33:33      13.26                      6.0
2   aa 2023-01-05 06:33:39       5.71                    768.0
3   aa 2023-01-05 06:33:45       0.00                      NaN
4   aa 2023-01-05 06:33:51       0.00                      NaN
5   aa 2023-01-05 06:33:57       0.00                      NaN
6   aa 2023-01-05 06:46:15       0.00                      NaN
7   aa 2023-01-05 06:46:21       0.00                      NaN
8   aa 2023-01-05 06:46:27       2.18                     12.0
9   aa 2023-01-05 06:46:33       0.00                      NaN
10  aa 2023-01-05 06:46:39       4.10                      6.0
11  aa 2023-01-05 06:46:45      21.73                      6.0
12  aa 2023-01-05 06:46:51      33.79                      NaN

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据条件和按照ID分组，在pandas中计算与下一行的时间差。

问题

答案1

输出

Output

Most efficient way to create an (n*m by m) numpy array with the first row all zeros and last row all n in Python?

numpy与字典列表一起使用：筛选元素的语法？

在pandas数据框中添加多个列和数值。

检查特定类是否在列表中

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论