如何计算pandas数据框中组间最近事件的平均值?

huangapple go评论93阅读模式
英文:

How calculate an average value of the most recent events across groups in pandas dataframe?

问题

以下是翻译好的部分:

我有一个带有事件(时间戳、值、公司 ID 等)的 pandas 数据框架。

示例:

  1. 时间戳 名称 用户数
  2. 0 2023-06-01 10:46:11 -1 A 1000
  3. 1 2023-06-01 11:12:12 1 A 1000
  4. 2 2023-06-01 15:52:44 0 A 1000
  5. 3 2023-06-01 18:24:15 0 A 1000
  6. 4 2023-06-01 19:19:58 1 A 1000
  7. 0 2023-06-01 07:00:41 0 B 2000
  8. 1 2023-06-01 09:44:46 -1 B 2000
  9. 2 2023-06-01 15:06:21 1 B 2000
  10. 3 2023-06-01 15:32:35 0 B 2000
  11. 4 2023-06-01 21:55:05 -1 B 2000
  12. 0 2023-06-01 08:20:33 0 C 3000
  13. 1 2023-06-01 15:02:17 -1 C 3000
  14. 2 2023-06-01 17:09:25 1 C 3000
  15. 3 2023-06-01 21:51:31 0 C 3000
  16. 4 2023-06-01 22:12:48 0 C 3000

对于每个事件,我需要在那个时刻计算跨公司最近事件的平均值。当然,最直接的方法就是遍历所有行,获取小于当前时间戳的每个公司的最新事件,然后计算平均值。

因此,对于上面的数据框架,下面是可以工作的“朴素”代码:

  1. res = []
  2. for index, row in df.iterrows():
  3. recent = df[df.timestamp <= row.timestamp]
  4. latest_values = recent.groupby('name').last()
  5. res.append(dict(timestamp=row.timestamp, value=latest_values.value.mean()))
  6. aggregated_df = pd.DataFrame(res)
  7. aggregated_df.sort_values('timestamp', inplace=True)
  8. aggregated_df

这将得到我需要的结果:

  1. 时间戳
  2. 5 2023-06-01 07:00:41 0.000000
  3. 10 2023-06-01 08:20:33 0.000000
  4. 6 2023-06-01 09:44:46 -0.500000
  5. 0 2023-06-01 10:46:11 -0.666667
  6. 1 2023-06-01 11:12:12 0.000000
  7. 11 2023-06-01 15:02:17 -0.333333
  8. 7 2023-06-01 15:06:21 0.333333
  9. 8 2023-06-01 15:32:35 0.000000
  10. 2 2023-06-01 15:52:44 -0.333333
  11. 12 2023-06-01 17:09:25 0.333333
  12. 3 2023-06-01 18:24:15 0.333333
  13. 4 2023-06-01 19:19:58 0.666667
  14. 13 2023-06-01 21:51:31 0.333333
  15. 9 2023-06-01 21:55:05 0.000000
  16. 14 2023-06-01 22:12:48 0.000000

但我想知道是否有一种更符合 pandas 风格的方式来获得相同的结果。

英文:

I have a pandas dataframe with events (timestamp, value, company id etc).

EXAMPLE:

  1. timestamp value name nusers
  2. 0 2023-06-01 10:46:11 -1 A 1000
  3. 1 2023-06-01 11:12:12 1 A 1000
  4. 2 2023-06-01 15:52:44 0 A 1000
  5. 3 2023-06-01 18:24:15 0 A 1000
  6. 4 2023-06-01 19:19:58 1 A 1000
  7. 0 2023-06-01 07:00:41 0 B 2000
  8. 1 2023-06-01 09:44:46 -1 B 2000
  9. 2 2023-06-01 15:06:21 1 B 2000
  10. 3 2023-06-01 15:32:35 0 B 2000
  11. 4 2023-06-01 21:55:05 -1 B 2000
  12. 0 2023-06-01 08:20:33 0 C 3000
  13. 1 2023-06-01 15:02:17 -1 C 3000
  14. 2 2023-06-01 17:09:25 1 C 3000
  15. 3 2023-06-01 21:51:31 0 C 3000
  16. 4 2023-06-01 22:12:48 0 C 3000

and for each event I need to calculate an average value of the most recent events across companies at that moment in time. Of course, the most straightforward way would be just loop through all rows, take the most recent events for each company less than the current tie stamp, and calculate an average.

So for the dataframe above the 'naive' code that works looks like that:

  1. res=[]
  2. for index, row in df.iterrows():
  3. recent=df[df.timestamp&lt;=row.timestamp]
  4. latest_values=recent.groupby(&#39;name&#39;).last()
  5. res.append(dict(timestamp=row.timestamp, value=latest_values.value.mean()))
  6. aggregated_df=pd.DataFrame(res)
  7. aggregated_df.sort_values(&#39;timestamp&#39;, inplace=True)
  8. aggregated_df

which results in what I need:

  1. timestamp value
  2. 5 2023-06-01 07:00:41 0.000000
  3. 10 2023-06-01 08:20:33 0.000000
  4. 6 2023-06-01 09:44:46 -0.500000
  5. 0 2023-06-01 10:46:11 -0.666667
  6. 1 2023-06-01 11:12:12 0.000000
  7. 11 2023-06-01 15:02:17 -0.333333
  8. 7 2023-06-01 15:06:21 0.333333
  9. 8 2023-06-01 15:32:35 0.000000
  10. 2 2023-06-01 15:52:44 -0.333333
  11. 12 2023-06-01 17:09:25 0.333333
  12. 3 2023-06-01 18:24:15 0.333333
  13. 4 2023-06-01 19:19:58 0.666667
  14. 13 2023-06-01 21:51:31 0.333333
  15. 9 2023-06-01 21:55:05 0.000000
  16. 14 2023-06-01 22:12:48 0.000000

But I wonder if there is a more pandas-like way of having the same result.

答案1

得分: 3

以下是您要的代码翻译:

  1. # 以下操作有效。
  2. df.sort_values("timestamp").set_index(["timestamp", "name"])["value"].unstack().ffill().mean(axis=1)
  3. # 输出:
  4. # timestamp
  5. # 2023-06-01 07:00:41 0.000000
  6. # 2023-06-01 08:20:33 0.000000
  7. # 2023-06-01 09:44:46 -0.500000
  8. # 2023-06-01 10:46:11 -0.666667
  9. # 2023-06-01 11:12:12 0.000000
  10. # 2023-06-01 15:02:17 -0.333333
  11. # 2023-06-01 15:06:21 0.333333
  12. # 2023-06-01 15:32:35 0.000000
  13. # 2023-06-01 15:52:44 -0.333333
  14. # 2023-06-01 17:09:25 0.333333
  15. # 2023-06-01 18:24:15 0.333333
  16. # 2023-06-01 19:19:58 0.666667
  17. # 2023-06-01 21:51:31 0.333333
  18. # 2023-06-01 21:55:05 0.000000
  19. # 2023-06-01 22:12:48 0.000000
  20. # dtype: float64

组件:

  1. 按时间顺序排序
  2. timestampname设置索引
  3. 仅获取value
  4. 然后展开(使名称成为不同的列)
  5. 前向填充,以获取每个时间戳的每个名称的最后值
  6. 对每行(时间戳)取平均值
英文:

The following works.

  1. df.sort_values(&quot;timestamp&quot;).set_index([&quot;timestamp&quot;, &quot;name&quot;])\
  2. [&quot;value&quot;].unstack().ffill().mean(axis=1)
  3. #Out[]:
  4. #timestamp
  5. #2023-06-01 07:00:41 0.000000
  6. #2023-06-01 08:20:33 0.000000
  7. #2023-06-01 09:44:46 -0.500000
  8. #2023-06-01 10:46:11 -0.666667
  9. #2023-06-01 11:12:12 0.000000
  10. #2023-06-01 15:02:17 -0.333333
  11. #2023-06-01 15:06:21 0.333333
  12. #2023-06-01 15:32:35 0.000000
  13. #2023-06-01 15:52:44 -0.333333
  14. #2023-06-01 17:09:25 0.333333
  15. #2023-06-01 18:24:15 0.333333
  16. #2023-06-01 19:19:58 0.666667
  17. #2023-06-01 21:51:31 0.333333
  18. #2023-06-01 21:55:05 0.000000
  19. #2023-06-01 22:12:48 0.000000
  20. #dtype: float64

Components:

  1. Sort values so that in chronological order
  2. Set index for timestamp and name
  3. Take the value column only
  4. Then unstack (so names are different columns)
  5. Forward fill so last value in time for each name for each timestamp
  6. Take the mean for each row (timestamp)

huangapple
  • 本文由 发表于 2023年5月25日 18:53:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76331512.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定