英文:
How calculate an average value of the most recent events across groups in pandas dataframe?
问题
以下是翻译好的部分:
我有一个带有事件(时间戳、值、公司 ID 等)的 pandas 数据框架。
示例:
时间戳 值 名称 用户数
0 2023-06-01 10:46:11 -1 A 1000
1 2023-06-01 11:12:12 1 A 1000
2 2023-06-01 15:52:44 0 A 1000
3 2023-06-01 18:24:15 0 A 1000
4 2023-06-01 19:19:58 1 A 1000
0 2023-06-01 07:00:41 0 B 2000
1 2023-06-01 09:44:46 -1 B 2000
2 2023-06-01 15:06:21 1 B 2000
3 2023-06-01 15:32:35 0 B 2000
4 2023-06-01 21:55:05 -1 B 2000
0 2023-06-01 08:20:33 0 C 3000
1 2023-06-01 15:02:17 -1 C 3000
2 2023-06-01 17:09:25 1 C 3000
3 2023-06-01 21:51:31 0 C 3000
4 2023-06-01 22:12:48 0 C 3000
对于每个事件,我需要在那个时刻计算跨公司最近事件的平均值。当然,最直接的方法就是遍历所有行,获取小于当前时间戳的每个公司的最新事件,然后计算平均值。
因此,对于上面的数据框架,下面是可以工作的“朴素”代码:
res = []
for index, row in df.iterrows():
recent = df[df.timestamp <= row.timestamp]
latest_values = recent.groupby('name').last()
res.append(dict(timestamp=row.timestamp, value=latest_values.value.mean()))
aggregated_df = pd.DataFrame(res)
aggregated_df.sort_values('timestamp', inplace=True)
aggregated_df
这将得到我需要的结果:
时间戳 值
5 2023-06-01 07:00:41 0.000000
10 2023-06-01 08:20:33 0.000000
6 2023-06-01 09:44:46 -0.500000
0 2023-06-01 10:46:11 -0.666667
1 2023-06-01 11:12:12 0.000000
11 2023-06-01 15:02:17 -0.333333
7 2023-06-01 15:06:21 0.333333
8 2023-06-01 15:32:35 0.000000
2 2023-06-01 15:52:44 -0.333333
12 2023-06-01 17:09:25 0.333333
3 2023-06-01 18:24:15 0.333333
4 2023-06-01 19:19:58 0.666667
13 2023-06-01 21:51:31 0.333333
9 2023-06-01 21:55:05 0.000000
14 2023-06-01 22:12:48 0.000000
但我想知道是否有一种更符合 pandas 风格的方式来获得相同的结果。
英文:
I have a pandas dataframe with events (timestamp, value, company id etc).
EXAMPLE:
timestamp value name nusers
0 2023-06-01 10:46:11 -1 A 1000
1 2023-06-01 11:12:12 1 A 1000
2 2023-06-01 15:52:44 0 A 1000
3 2023-06-01 18:24:15 0 A 1000
4 2023-06-01 19:19:58 1 A 1000
0 2023-06-01 07:00:41 0 B 2000
1 2023-06-01 09:44:46 -1 B 2000
2 2023-06-01 15:06:21 1 B 2000
3 2023-06-01 15:32:35 0 B 2000
4 2023-06-01 21:55:05 -1 B 2000
0 2023-06-01 08:20:33 0 C 3000
1 2023-06-01 15:02:17 -1 C 3000
2 2023-06-01 17:09:25 1 C 3000
3 2023-06-01 21:51:31 0 C 3000
4 2023-06-01 22:12:48 0 C 3000
and for each event I need to calculate an average value of the most recent events across companies at that moment in time. Of course, the most straightforward way would be just loop through all rows, take the most recent events for each company less than the current tie stamp, and calculate an average.
So for the dataframe above the 'naive' code that works looks like that:
res=[]
for index, row in df.iterrows():
recent=df[df.timestamp<=row.timestamp]
latest_values=recent.groupby('name').last()
res.append(dict(timestamp=row.timestamp, value=latest_values.value.mean()))
aggregated_df=pd.DataFrame(res)
aggregated_df.sort_values('timestamp', inplace=True)
aggregated_df
which results in what I need:
timestamp value
5 2023-06-01 07:00:41 0.000000
10 2023-06-01 08:20:33 0.000000
6 2023-06-01 09:44:46 -0.500000
0 2023-06-01 10:46:11 -0.666667
1 2023-06-01 11:12:12 0.000000
11 2023-06-01 15:02:17 -0.333333
7 2023-06-01 15:06:21 0.333333
8 2023-06-01 15:32:35 0.000000
2 2023-06-01 15:52:44 -0.333333
12 2023-06-01 17:09:25 0.333333
3 2023-06-01 18:24:15 0.333333
4 2023-06-01 19:19:58 0.666667
13 2023-06-01 21:51:31 0.333333
9 2023-06-01 21:55:05 0.000000
14 2023-06-01 22:12:48 0.000000
But I wonder if there is a more pandas-like way of having the same result.
答案1
得分: 3
以下是您要的代码翻译:
# 以下操作有效。
df.sort_values("timestamp").set_index(["timestamp", "name"])["value"].unstack().ffill().mean(axis=1)
# 输出:
# timestamp
# 2023-06-01 07:00:41 0.000000
# 2023-06-01 08:20:33 0.000000
# 2023-06-01 09:44:46 -0.500000
# 2023-06-01 10:46:11 -0.666667
# 2023-06-01 11:12:12 0.000000
# 2023-06-01 15:02:17 -0.333333
# 2023-06-01 15:06:21 0.333333
# 2023-06-01 15:32:35 0.000000
# 2023-06-01 15:52:44 -0.333333
# 2023-06-01 17:09:25 0.333333
# 2023-06-01 18:24:15 0.333333
# 2023-06-01 19:19:58 0.666667
# 2023-06-01 21:51:31 0.333333
# 2023-06-01 21:55:05 0.000000
# 2023-06-01 22:12:48 0.000000
# dtype: float64
组件:
- 按时间顺序排序
- 为
timestamp
和name
设置索引 - 仅获取
value
列 - 然后展开(使名称成为不同的列)
- 前向填充,以获取每个时间戳的每个名称的最后值
- 对每行(时间戳)取平均值
英文:
The following works.
df.sort_values("timestamp").set_index(["timestamp", "name"])\
["value"].unstack().ffill().mean(axis=1)
#Out[]:
#timestamp
#2023-06-01 07:00:41 0.000000
#2023-06-01 08:20:33 0.000000
#2023-06-01 09:44:46 -0.500000
#2023-06-01 10:46:11 -0.666667
#2023-06-01 11:12:12 0.000000
#2023-06-01 15:02:17 -0.333333
#2023-06-01 15:06:21 0.333333
#2023-06-01 15:32:35 0.000000
#2023-06-01 15:52:44 -0.333333
#2023-06-01 17:09:25 0.333333
#2023-06-01 18:24:15 0.333333
#2023-06-01 19:19:58 0.666667
#2023-06-01 21:51:31 0.333333
#2023-06-01 21:55:05 0.000000
#2023-06-01 22:12:48 0.000000
#dtype: float64
Components:
- Sort values so that in chronological order
- Set index for
timestamp
andname
- Take the
value
column only - Then unstack (so names are different columns)
- Forward fill so last value in time for each name for each timestamp
- Take the mean for each row (timestamp)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论