如何计算pandas数据框中组间最近事件的平均值?

huangapple go评论69阅读模式
英文:

How calculate an average value of the most recent events across groups in pandas dataframe?

问题

以下是翻译好的部分:

我有一个带有事件(时间戳、值、公司 ID 等)的 pandas 数据框架。

示例:

时间戳	        值	    名称	    用户数
0	2023-06-01 10:46:11	-1	A	1000
1	2023-06-01 11:12:12	1	A	1000
2	2023-06-01 15:52:44	0	A	1000
3	2023-06-01 18:24:15	0	A	1000
4	2023-06-01 19:19:58	1	A	1000
0	2023-06-01 07:00:41	0	B	2000
1	2023-06-01 09:44:46	-1	B	2000
2	2023-06-01 15:06:21	1	B	2000
3	2023-06-01 15:32:35	0	B	2000
4	2023-06-01 21:55:05	-1	B	2000
0	2023-06-01 08:20:33	0	C	3000
1	2023-06-01 15:02:17	-1	C	3000
2	2023-06-01 17:09:25	1	C	3000
3	2023-06-01 21:51:31	0	C	3000
4	2023-06-01 22:12:48	0	C	3000

对于每个事件,我需要在那个时刻计算跨公司最近事件的平均值。当然,最直接的方法就是遍历所有行,获取小于当前时间戳的每个公司的最新事件,然后计算平均值。

因此,对于上面的数据框架,下面是可以工作的“朴素”代码:

res = []
for index, row in df.iterrows():
    recent = df[df.timestamp <= row.timestamp]
    latest_values = recent.groupby('name').last()
    res.append(dict(timestamp=row.timestamp, value=latest_values.value.mean()))

aggregated_df = pd.DataFrame(res)
aggregated_df.sort_values('timestamp', inplace=True)
aggregated_df

这将得到我需要的结果:

    时间戳	                值
5	2023-06-01 07:00:41	0.000000
10	2023-06-01 08:20:33	0.000000
6	2023-06-01 09:44:46	-0.500000
0	2023-06-01 10:46:11	-0.666667
1	2023-06-01 11:12:12	0.000000
11	2023-06-01 15:02:17	-0.333333
7	2023-06-01 15:06:21	0.333333
8	2023-06-01 15:32:35	0.000000
2	2023-06-01 15:52:44	-0.333333
12	2023-06-01 17:09:25	0.333333
3	2023-06-01 18:24:15	0.333333
4	2023-06-01 19:19:58	0.666667
13	2023-06-01 21:51:31	0.333333
9	2023-06-01 21:55:05	0.000000
14	2023-06-01 22:12:48	0.000000

但我想知道是否有一种更符合 pandas 风格的方式来获得相同的结果。

英文:

I have a pandas dataframe with events (timestamp, value, company id etc).

EXAMPLE:


timestamp	value	name	nusers
0	2023-06-01 10:46:11	-1	A	1000
1	2023-06-01 11:12:12	1	A	1000
2	2023-06-01 15:52:44	0	A	1000
3	2023-06-01 18:24:15	0	A	1000
4	2023-06-01 19:19:58	1	A	1000
0	2023-06-01 07:00:41	0	B	2000
1	2023-06-01 09:44:46	-1	B	2000
2	2023-06-01 15:06:21	1	B	2000
3	2023-06-01 15:32:35	0	B	2000
4	2023-06-01 21:55:05	-1	B	2000
0	2023-06-01 08:20:33	0	C	3000
1	2023-06-01 15:02:17	-1	C	3000
2	2023-06-01 17:09:25	1	C	3000
3	2023-06-01 21:51:31	0	C	3000
4	2023-06-01 22:12:48	0	C	3000

and for each event I need to calculate an average value of the most recent events across companies at that moment in time. Of course, the most straightforward way would be just loop through all rows, take the most recent events for each company less than the current tie stamp, and calculate an average.

So for the dataframe above the 'naive' code that works looks like that:

res=[]
for index, row in df.iterrows():
    recent=df[df.timestamp&lt;=row.timestamp]
    latest_values=recent.groupby(&#39;name&#39;).last()
    res.append(dict(timestamp=row.timestamp, value=latest_values.value.mean()))

aggregated_df=pd.DataFrame(res)
aggregated_df.sort_values(&#39;timestamp&#39;, inplace=True)
aggregated_df

which results in what I need:

	timestamp	value
5	2023-06-01 07:00:41	0.000000
10	2023-06-01 08:20:33	0.000000
6	2023-06-01 09:44:46	-0.500000
0	2023-06-01 10:46:11	-0.666667
1	2023-06-01 11:12:12	0.000000
11	2023-06-01 15:02:17	-0.333333
7	2023-06-01 15:06:21	0.333333
8	2023-06-01 15:32:35	0.000000
2	2023-06-01 15:52:44	-0.333333
12	2023-06-01 17:09:25	0.333333
3	2023-06-01 18:24:15	0.333333
4	2023-06-01 19:19:58	0.666667
13	2023-06-01 21:51:31	0.333333
9	2023-06-01 21:55:05	0.000000
14	2023-06-01 22:12:48	0.000000

But I wonder if there is a more pandas-like way of having the same result.

答案1

得分: 3

以下是您要的代码翻译:

# 以下操作有效。
df.sort_values("timestamp").set_index(["timestamp", "name"])["value"].unstack().ffill().mean(axis=1)

# 输出:
# timestamp
# 2023-06-01 07:00:41    0.000000
# 2023-06-01 08:20:33    0.000000
# 2023-06-01 09:44:46   -0.500000
# 2023-06-01 10:46:11   -0.666667
# 2023-06-01 11:12:12    0.000000
# 2023-06-01 15:02:17   -0.333333
# 2023-06-01 15:06:21    0.333333
# 2023-06-01 15:32:35    0.000000
# 2023-06-01 15:52:44   -0.333333
# 2023-06-01 17:09:25    0.333333
# 2023-06-01 18:24:15    0.333333
# 2023-06-01 19:19:58    0.666667
# 2023-06-01 21:51:31    0.333333
# 2023-06-01 21:55:05    0.000000
# 2023-06-01 22:12:48    0.000000
# dtype: float64

组件:

  1. 按时间顺序排序
  2. timestampname设置索引
  3. 仅获取value
  4. 然后展开(使名称成为不同的列)
  5. 前向填充,以获取每个时间戳的每个名称的最后值
  6. 对每行(时间戳)取平均值
英文:

The following works.

df.sort_values(&quot;timestamp&quot;).set_index([&quot;timestamp&quot;, &quot;name&quot;])\
    [&quot;value&quot;].unstack().ffill().mean(axis=1)

#Out[]: 
#timestamp
#2023-06-01 07:00:41    0.000000
#2023-06-01 08:20:33    0.000000
#2023-06-01 09:44:46   -0.500000
#2023-06-01 10:46:11   -0.666667
#2023-06-01 11:12:12    0.000000
#2023-06-01 15:02:17   -0.333333
#2023-06-01 15:06:21    0.333333
#2023-06-01 15:32:35    0.000000
#2023-06-01 15:52:44   -0.333333
#2023-06-01 17:09:25    0.333333
#2023-06-01 18:24:15    0.333333
#2023-06-01 19:19:58    0.666667
#2023-06-01 21:51:31    0.333333
#2023-06-01 21:55:05    0.000000
#2023-06-01 22:12:48    0.000000
#dtype: float64

Components:

  1. Sort values so that in chronological order
  2. Set index for timestamp and name
  3. Take the value column only
  4. Then unstack (so names are different columns)
  5. Forward fill so last value in time for each name for each timestamp
  6. Take the mean for each row (timestamp)

huangapple
  • 本文由 发表于 2023年5月25日 18:53:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76331512.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定