2023年4月11日 01:14:41go评论101阅读模式

英文:

How to generate pandas dataframe timedelta column grouped by id and date (YYYY-MM-DD)?

问题

id	datetime	            datetime_baseline	    timedelta
a1	2016-01-01 00:01:00.156	2016-01-01 00:01:00.156	0
a1	2016-01-01 12:00:00.425	2016-01-01 00:01:00.156	719
a1	2016-01-02 00:59:00.123	2016-01-02 00:59:00.123	0
a1	2016-01-02 14:16:00.548	2016-01-02 00:59:00.123	797
a2	2016-01-01 12:00:00.147	2016-01-01 12:00:00.147	0
a2	2016-01-01 13:59:00.123	2016-01-01 12:00:00.147	119
a2	2016-01-02 08:01:00.147	2016-01-02 08:01:00.147	0
a2	2016-01-02 18:49:00.123	2016-01-02 08:01:00.147	648
a3	2016-02-01 12:00:00.147	2016-02-01 12:00:00.147	0
a3	2016-02-01 13:59:00.123	2016-02-01 12:00:00.147	119
a3	2016-02-02 08:01:00.147	2016-02-02 08:01:00.147	0
a3	2016-02-02 18:49:00.123	2016-02-02 08:01:00.147	648

英文:

Suppose I have a dataframe with id and datetime columns:

df = pd.DataFrame({&quot;id&quot;: [&quot;a1&quot;, &quot;a1&quot;, &quot;a1&quot;, &quot;a1&quot;, &quot;a2&quot;, &quot;a2&quot;, &quot;a2&quot;, &quot;a2&quot;, &quot;a3&quot;, &quot;a3&quot;, &quot;a3&quot;, &quot;a3&quot;],
                   &quot;datetime&quot;: [&quot;2016-01-01 00:01:00.156&quot;,
                                &quot;2016-01-01 12:00:00.425&quot;,
                                &quot;2016-01-02 00:59:00.123&quot;,
                                &quot;2016-01-02 14:16:00.548&quot;,
                                &quot;2016-01-01 12:00:00.147&quot;,
                                &quot;2016-01-01 13:59:00.123&quot;,
                                &quot;2016-01-02 08:01:00.147&quot;,
                                &quot;2016-01-02 18:49:00.123&quot;,
                                &quot;2016-02-01 12:00:00.147&quot;,
                                &quot;2016-02-01 13:59:00.123&quot;,
                                &quot;2016-02-02 08:01:00.147&quot;,
                                &quot;2016-02-02 18:49:00.123&quot;]})
df[&quot;datetime&quot;] = pd.to_datetime(df[&quot;datetime&quot;])
df

Here is the dataframe:

    id	datetime
0	a1	2016-01-01 00:01:00.156
1	a1	2016-01-01 12:00:00.425
2	a1	2016-01-02 00:59:00.123
3	a1	2016-01-02 14:16:00.548
4	a2	2016-01-01 12:00:00.147
5	a2	2016-01-01 13:59:00.123
6	a2	2016-01-02 08:01:00.147
7	a2	2016-01-02 18:49:00.123
8	a3	2016-02-01 12:00:00.147
9	a3	2016-02-01 13:59:00.123
10	a3	2016-02-02 08:01:00.147
11	a3	2016-02-02 18:49:00.123

I want to generate column timedelta that has a timedelta value. This is the output I expect to get:

    id	datetime	            datetime_baseline	    timedelta
0	a1	2016-01-01 00:01:00.156	2016-01-01 00:01:00.156	0
1	a1	2016-01-01 12:00:00.425	2016-01-01 00:01:00.156	719
2	a1	2016-01-02 00:59:00.123	2016-01-02 00:59:00.123	0
3	a1	2016-01-02 14:16:00.548	2016-01-02 00:59:00.123	797
4	a2	2016-01-01 12:00:00.147	2016-01-01 12:00:00.147	0
5	a2	2016-01-01 13:59:00.123	2016-01-01 12:00:00.147	119
6	a2	2016-01-02 08:01:00.147	2016-01-02 08:01:00.147	0
7	a2	2016-01-02 18:49:00.123	2016-01-02 08:01:00.147	648
8	a3	2016-02-01 12:00:00.147	2016-02-01 12:00:00.147	0
9	a3	2016-02-01 13:59:00.123	2016-02-01 12:00:00.147	119
10	a3	2016-02-02 08:01:00.147	2016-02-02 08:01:00.147	0
11	a3	2016-02-02 18:49:00.123	2016-02-02 08:01:00.147	648

Here is how the timedelta values should be calculated: 1) the code needs to identify the FIRST datetime within the same id and date ('YYYY-MM-DD'), and 2) use it as baseline (datetime_baseline) to compute the timedelta (in minutes) w.r.t. other datetimes within same id and same date. For id='a1' and date='2016-01-01', the datetime_baseline='2016-01-01 00:01:00.156'. So, at index=0, timedelta has value=0 because '2016-01-01 00:01:00.156' - datetime_baseline=0. Meanwhile, at index=1, timedelta has value=719 because '2016-01-01 12:00:00.425' - datetime_baseline=719 (minutes). At index=2, id is the same as before but date is now '2016-01-02', so a new baseline will be used: '2016-01-02 00:59:00.123'. timedelta='2016-01-02 00:59:00.123' - datetime_baseline=0. At index=3, timedelta='2016-01-02 14:16:00.548' - datetime_baseline=797.

Although I see how the timedelta values should be calculated (timedelta=datetime-datetime_baseline), I don't know how to have the baseline values identified (i.e. how to generate datetime_baseline column). Please, let me know if you need any further explanation.

ps> the actual dataframe has +500 thousand rows.

答案1

得分: 2

使用GroupBy.transform进行基线计算：

df["datetime_baseline"] = (df.groupby(["id", df["datetime"].dt.date])
                                       ["datetime"].transform("first"))

并使用dt.total_seconds计算时间差：

df["timedelta"] = ((df["datetime"].sub(df["datetime_baseline"]))
                              .dt.total_seconds().div(60).round(0).astype(int))

输出：

print(df)
    
    id                datetime       datetime_baseline  timedelta
0   a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156          0
1   a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156        719
2   a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123          0
3   a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123        797
4   a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147          0
5   a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147        119
6   a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147          0
7   a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147        648
8   a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147          0
9   a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147        119
10  a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147          0
11  a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147        648

英文:

With GroupBy.transform to make the baseline :

df[&quot;datetime_baseline&quot;] = (df.groupby([&quot;id&quot;, df[&quot;datetime&quot;].dt.date])
[&quot;datetime&quot;].transform(&quot;first&quot;))

And dt.total_seconds to compute the timedelta :

df[&quot;timedelta&quot;] = ((df[&quot;datetime&quot;].sub(df[&quot;datetime_baseline&quot;]))
.dt.total_seconds().div(60).round(0).astype(int))

Output :

print(df)
id                datetime       datetime_baseline  timedelta
0   a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156          0
1   a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156        719
2   a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123          0
3   a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123        797
4   a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147          0
5   a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147        119
6   a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147          0
7   a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147        648
8   a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147          0
9   a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147        119
10  a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147          0
11  a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147        648

答案2

得分: 1

df['datetime_baseline'] = df.groupby(['id', df['datetime'].dt.date])["datetime"].transform('min')
df['timedelta'] = np.round((df['datetime'] - df['datetime_baseline']).dt.seconds / 60)
print(df)

英文:

Try:

df[&#39;datetime_baseline&#39;] = df.groupby([&#39;id&#39;, df[&#39;datetime&#39;].dt.date])[&quot;datetime&quot;].transform(&#39;min&#39;)
df[&#39;timedelta&#39;] = np.round((df[&#39;datetime&#39;] - df[&#39;datetime_baseline&#39;]).dt.seconds / 60)
print(df)

Prints:

    id                datetime       datetime_baseline  timedelta
0   a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156        0.0
1   a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156      719.0
2   a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123        0.0
3   a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123      797.0
4   a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147        0.0
5   a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147      119.0
6   a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147        0.0
7   a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147      648.0
8   a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147        0.0
9   a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147      119.0
10  a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147        0.0
11  a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147      648.0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何按照id和日期（YYYY-MM-DD）分组生成pandas数据框的时间差列？

问题

答案1

答案2

Python3：当由Popen启动时，子进程会休眠

Plotnine：如何使用geom_col和geom_text显示分组的均值

Python的fillna方法添加.0

在Python中定义具有s个分量的单位向量

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。