2023年5月25日 18:53:44go评论96阅读模式

英文:

How calculate an average value of the most recent events across groups in pandas dataframe?

问题

以下是翻译好的部分：

我有一个带有事件（时间戳、值、公司 ID 等）的 pandas 数据框架。

示例：

时间戳	        值	    名称	    用户数
0	2023-06-01 10:46:11	-1	A	1000
1	2023-06-01 11:12:12	1	A	1000
2	2023-06-01 15:52:44	0	A	1000
3	2023-06-01 18:24:15	0	A	1000
4	2023-06-01 19:19:58	1	A	1000
0	2023-06-01 07:00:41	0	B	2000
1	2023-06-01 09:44:46	-1	B	2000
2	2023-06-01 15:06:21	1	B	2000
3	2023-06-01 15:32:35	0	B	2000
4	2023-06-01 21:55:05	-1	B	2000
0	2023-06-01 08:20:33	0	C	3000
1	2023-06-01 15:02:17	-1	C	3000
2	2023-06-01 17:09:25	1	C	3000
3	2023-06-01 21:51:31	0	C	3000
4	2023-06-01 22:12:48	0	C	3000

对于每个事件，我需要在那个时刻计算跨公司最近事件的平均值。当然，最直接的方法就是遍历所有行，获取小于当前时间戳的每个公司的最新事件，然后计算平均值。

因此，对于上面的数据框架，下面是可以工作的“朴素”代码：

res = []
for index, row in df.iterrows():
    recent = df[df.timestamp <= row.timestamp]
    latest_values = recent.groupby('name').last()
    res.append(dict(timestamp=row.timestamp, value=latest_values.value.mean()))
aggregated_df = pd.DataFrame(res)
aggregated_df.sort_values('timestamp', inplace=True)
aggregated_df

这将得到我需要的结果：

    时间戳	                值
5	2023-06-01 07:00:41	0.000000
10	2023-06-01 08:20:33	0.000000
6	2023-06-01 09:44:46	-0.500000
0	2023-06-01 10:46:11	-0.666667
1	2023-06-01 11:12:12	0.000000
11	2023-06-01 15:02:17	-0.333333
7	2023-06-01 15:06:21	0.333333
8	2023-06-01 15:32:35	0.000000
2	2023-06-01 15:52:44	-0.333333
12	2023-06-01 17:09:25	0.333333
3	2023-06-01 18:24:15	0.333333
4	2023-06-01 19:19:58	0.666667
13	2023-06-01 21:51:31	0.333333
9	2023-06-01 21:55:05	0.000000
14	2023-06-01 22:12:48	0.000000

但我想知道是否有一种更符合 pandas 风格的方式来获得相同的结果。

英文:

I have a pandas dataframe with events (timestamp, value, company id etc).

EXAMPLE:


timestamp	value	name	nusers
0	2023-06-01 10:46:11	-1	A	1000
1	2023-06-01 11:12:12	1	A	1000
2	2023-06-01 15:52:44	0	A	1000
3	2023-06-01 18:24:15	0	A	1000
4	2023-06-01 19:19:58	1	A	1000
0	2023-06-01 07:00:41	0	B	2000
1	2023-06-01 09:44:46	-1	B	2000
2	2023-06-01 15:06:21	1	B	2000
3	2023-06-01 15:32:35	0	B	2000
4	2023-06-01 21:55:05	-1	B	2000
0	2023-06-01 08:20:33	0	C	3000
1	2023-06-01 15:02:17	-1	C	3000
2	2023-06-01 17:09:25	1	C	3000
3	2023-06-01 21:51:31	0	C	3000
4	2023-06-01 22:12:48	0	C	3000

and for each event I need to calculate an average value of the most recent events across companies at that moment in time. Of course, the most straightforward way would be just loop through all rows, take the most recent events for each company less than the current tie stamp, and calculate an average.

So for the dataframe above the 'naive' code that works looks like that:

res=[]
for index, row in df.iterrows():
    recent=df[df.timestamp&lt;=row.timestamp]
    latest_values=recent.groupby(&#39;name&#39;).last()
    res.append(dict(timestamp=row.timestamp, value=latest_values.value.mean()))
aggregated_df=pd.DataFrame(res)
aggregated_df.sort_values(&#39;timestamp&#39;, inplace=True)
aggregated_df

which results in what I need:

	timestamp	value
5	2023-06-01 07:00:41	0.000000
10	2023-06-01 08:20:33	0.000000
6	2023-06-01 09:44:46	-0.500000
0	2023-06-01 10:46:11	-0.666667
1	2023-06-01 11:12:12	0.000000
11	2023-06-01 15:02:17	-0.333333
7	2023-06-01 15:06:21	0.333333
8	2023-06-01 15:32:35	0.000000
2	2023-06-01 15:52:44	-0.333333
12	2023-06-01 17:09:25	0.333333
3	2023-06-01 18:24:15	0.333333
4	2023-06-01 19:19:58	0.666667
13	2023-06-01 21:51:31	0.333333
9	2023-06-01 21:55:05	0.000000
14	2023-06-01 22:12:48	0.000000

But I wonder if there is a more pandas-like way of having the same result.

答案1

得分: 3

以下是您要的代码翻译：

# 以下操作有效。
df.sort_values("timestamp").set_index(["timestamp", "name"])["value"].unstack().ffill().mean(axis=1)
# 输出：
# timestamp
# 2023-06-01 07:00:41    0.000000
# 2023-06-01 08:20:33    0.000000
# 2023-06-01 09:44:46   -0.500000
# 2023-06-01 10:46:11   -0.666667
# 2023-06-01 11:12:12    0.000000
# 2023-06-01 15:02:17   -0.333333
# 2023-06-01 15:06:21    0.333333
# 2023-06-01 15:32:35    0.000000
# 2023-06-01 15:52:44   -0.333333
# 2023-06-01 17:09:25    0.333333
# 2023-06-01 18:24:15    0.333333
# 2023-06-01 19:19:58    0.666667
# 2023-06-01 21:51:31    0.333333
# 2023-06-01 21:55:05    0.000000
# 2023-06-01 22:12:48    0.000000
# dtype: float64

组件：

按时间顺序排序
为timestamp和name设置索引
仅获取value列
然后展开（使名称成为不同的列）
前向填充，以获取每个时间戳的每个名称的最后值
对每行（时间戳）取平均值

英文:

The following works.

df.sort_values(&quot;timestamp&quot;).set_index([&quot;timestamp&quot;, &quot;name&quot;])\
    [&quot;value&quot;].unstack().ffill().mean(axis=1)
#Out[]: 
#timestamp
#2023-06-01 07:00:41    0.000000
#2023-06-01 08:20:33    0.000000
#2023-06-01 09:44:46   -0.500000
#2023-06-01 10:46:11   -0.666667
#2023-06-01 11:12:12    0.000000
#2023-06-01 15:02:17   -0.333333
#2023-06-01 15:06:21    0.333333
#2023-06-01 15:32:35    0.000000
#2023-06-01 15:52:44   -0.333333
#2023-06-01 17:09:25    0.333333
#2023-06-01 18:24:15    0.333333
#2023-06-01 19:19:58    0.666667
#2023-06-01 21:51:31    0.333333
#2023-06-01 21:55:05    0.000000
#2023-06-01 22:12:48    0.000000
#dtype: float64

Components:

Sort values so that in chronological order
Set index for timestamp and name
Take the value column only
Then unstack (so names are different columns)
Forward fill so last value in time for each name for each timestamp
Take the mean for each row (timestamp)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何计算pandas数据框中组间最近事件的平均值？

问题

答案1

如何防止VSCode在语句间重新排列Python导入？

一列在条件下的平均时间差

柱状图基于两列数据

保留缩进与Tesseract OCR 4.x

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。