group_rolling per group in polars

huangapple go评论82阅读模式
英文:

group_rolling per group in polars

问题

以下是翻译好的代码部分:

  1. 让我们假设我有以下的 DataFrame
  2. example_df = pl.DataFrame(
  3. {
  4. "date": pl.date_range(low=date(2022, 1, 1), high=date(2022, 1, 30), interval="1d"),
  5. "group": ["A", "B", "C"]*10,
  6. 'total': np.random.randint(100, size=30)
  7. }
  8. )
  9. 现在我想在5天的时间窗口内对每个组的总和进行求和类似于将 groupby groupby_rolling 结合在一起
  10. 我想知道是否有比以下代码更加优雅和高效的方法来实现
  11. df_list = []
  12. for table_group in example_df['group'].unique():
  13. temp_df = (
  14. example_df
  15. .filter(pl.col('group')==table_group)
  16. .sort('date')
  17. .groupby_rolling('date', period='5d')
  18. .agg(pl.col('total').sum().prefix('roll_'))
  19. )
  20. temp_df2 = (
  21. example_df
  22. .filter(pl.col('group')==table_group)
  23. .sort('date')
  24. .select([
  25. pl.all(),
  26. temp_df.get_column('roll_total')
  27. ])
  28. )
  29. df_list.append(temp_df2)
  30. pl.concat(df_list)

希望这有所帮助。

英文:

Let's say I have the following df:

  1. example_df = pl.DataFrame(
  2. {
  3. "date": pl.date_range(low=date(2022, 1, 1), high=date(2022, 1, 30), interval="1d"),
  4. "group": ["A", "B", "C"]*10,
  5. 'total': np.random.randint(100, size=30)
  6. }
  7. )

Now I want to sum the total per group, in a time window of 5 days. Something like combining groupby with groupby_rolling.

I wonder if there more elegant and efficient way to do it than the following code:

  1. df_list = []
  2. for table_group in example_df['group'].unique():
  3. temp_df = (
  4. example_df
  5. .filter(pl.col('group')==table_group)
  6. .sort('date')
  7. .groupby_rolling('date', period='5d')
  8. .agg(pl.col('total').sum().prefix('roll_'))
  9. )
  10. temp_df2 = (
  11. example_df
  12. .filter(pl.col('group')==table_group)
  13. .sort('date')
  14. .select([
  15. pl.all(),
  16. temp_df.get_column('roll_total')
  17. ])
  18. )
  19. df_list.append(temp_df2)
  20. pl.concat(df_list)

My main issue is that I have millions of rows and over 10,000 groups(and more than 1 column that I want to sum)...
So this code is not so elegant and consumes more memory than I want it to.

答案1

得分: 1

以下是您要翻译的内容:

"It appears that the piece you're missing is the by parameter.

This allows you to specify grouping.

  1. (df.groupby_rolling(index_column="date", by="group", period="5d")
  2. .agg(
  3. pl.col("total").last(),
  4. roll_total = pl.sum("total")))
  1. shape: (30, 4)
  2. ┌───────┬────────────┬───────┬────────────┐
  3. group date total roll_total
  4. --- --- --- ---
  5. str date i64 i64
  6. ╞═══════╪════════════╪═══════╪════════════╡
  7. A 2022-01-01 96 96
  8. A 2022-01-04 16 112
  9. A 2022-01-07 17 33
  10. A 2022-01-10 4 21
  11. C 2022-01-21 17 78
  12. C 2022-01-24 84 101
  13. C 2022-01-27 78 162
  14. C 2022-01-30 54 132
  15. └───────┴────────────┴───────┴────────────┘
  16. ```"
  17. <details>
  18. <summary>英文:</summary>
  19. It appears that the piece you&#39;re missing is the [`by`](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.groupby_rolling.html) parameter.
  20. This allows you to specify grouping.

(df.groupby_rolling(index_column="date", by="group", period="5d")
.agg(
pl.col("total").last(),
roll_total = pl.sum("total")))

shape: (30, 4)
┌───────┬────────────┬───────┬────────────┐
│ group ┆ date ┆ total ┆ roll_total │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ i64 ┆ i64 │
╞═══════╪════════════╪═══════╪════════════╡
│ A ┆ 2022-01-01 ┆ 96 ┆ 96 │
│ A ┆ 2022-01-04 ┆ 16 ┆ 112 │
│ A ┆ 2022-01-07 ┆ 17 ┆ 33 │
│ A ┆ 2022-01-10 ┆ 4 ┆ 21 │
│ … ┆ … ┆ … ┆ … │
│ C ┆ 2022-01-21 ┆ 17 ┆ 78 │
│ C ┆ 2022-01-24 ┆ 84 ┆ 101 │
│ C ┆ 2022-01-27 ┆ 78 ┆ 162 │
│ C ┆ 2022-01-30 ┆ 54 ┆ 132 │
└───────┴────────────┴───────┴────────────┘

  1. </details>

huangapple
  • 本文由 发表于 2023年3月21日 02:25:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/75793994.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定