英文:
group_rolling per group in polars
问题
以下是翻译好的代码部分:
让我们假设我有以下的 DataFrame:
example_df = pl.DataFrame(
{
"date": pl.date_range(low=date(2022, 1, 1), high=date(2022, 1, 30), interval="1d"),
"group": ["A", "B", "C"]*10,
'total': np.random.randint(100, size=30)
}
)
现在,我想在5天的时间窗口内对每个组的总和进行求和。类似于将 groupby 与 groupby_rolling 结合在一起。
我想知道是否有比以下代码更加优雅和高效的方法来实现:
df_list = []
for table_group in example_df['group'].unique():
temp_df = (
example_df
.filter(pl.col('group')==table_group)
.sort('date')
.groupby_rolling('date', period='5d')
.agg(pl.col('total').sum().prefix('roll_'))
)
temp_df2 = (
example_df
.filter(pl.col('group')==table_group)
.sort('date')
.select([
pl.all(),
temp_df.get_column('roll_total')
])
)
df_list.append(temp_df2)
pl.concat(df_list)
希望这有所帮助。
英文:
Let's say I have the following df:
example_df = pl.DataFrame(
{
"date": pl.date_range(low=date(2022, 1, 1), high=date(2022, 1, 30), interval="1d"),
"group": ["A", "B", "C"]*10,
'total': np.random.randint(100, size=30)
}
)
Now I want to sum the total per group, in a time window of 5 days. Something like combining groupby with groupby_rolling.
I wonder if there more elegant and efficient way to do it than the following code:
df_list = []
for table_group in example_df['group'].unique():
temp_df = (
example_df
.filter(pl.col('group')==table_group)
.sort('date')
.groupby_rolling('date', period='5d')
.agg(pl.col('total').sum().prefix('roll_'))
)
temp_df2 = (
example_df
.filter(pl.col('group')==table_group)
.sort('date')
.select([
pl.all(),
temp_df.get_column('roll_total')
])
)
df_list.append(temp_df2)
pl.concat(df_list)
My main issue is that I have millions of rows and over 10,000 groups(and more than 1 column that I want to sum)...
So this code is not so elegant and consumes more memory than I want it to.
答案1
得分: 1
以下是您要翻译的内容:
"It appears that the piece you're missing is the by
parameter.
This allows you to specify grouping.
(df.groupby_rolling(index_column="date", by="group", period="5d")
.agg(
pl.col("total").last(),
roll_total = pl.sum("total")))
shape: (30, 4)
┌───────┬────────────┬───────┬────────────┐
│ group ┆ date ┆ total ┆ roll_total │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ i64 ┆ i64 │
╞═══════╪════════════╪═══════╪════════════╡
│ A ┆ 2022-01-01 ┆ 96 ┆ 96 │
│ A ┆ 2022-01-04 ┆ 16 ┆ 112 │
│ A ┆ 2022-01-07 ┆ 17 ┆ 33 │
│ A ┆ 2022-01-10 ┆ 4 ┆ 21 │
│ … ┆ … ┆ … ┆ … │
│ C ┆ 2022-01-21 ┆ 17 ┆ 78 │
│ C ┆ 2022-01-24 ┆ 84 ┆ 101 │
│ C ┆ 2022-01-27 ┆ 78 ┆ 162 │
│ C ┆ 2022-01-30 ┆ 54 ┆ 132 │
└───────┴────────────┴───────┴────────────┘
```"
<details>
<summary>英文:</summary>
It appears that the piece you're missing is the [`by`](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.groupby_rolling.html) parameter.
This allows you to specify grouping.
(df.groupby_rolling(index_column="date", by="group", period="5d")
.agg(
pl.col("total").last(),
roll_total = pl.sum("total")))
shape: (30, 4)
┌───────┬────────────┬───────┬────────────┐
│ group ┆ date ┆ total ┆ roll_total │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ i64 ┆ i64 │
╞═══════╪════════════╪═══════╪════════════╡
│ A ┆ 2022-01-01 ┆ 96 ┆ 96 │
│ A ┆ 2022-01-04 ┆ 16 ┆ 112 │
│ A ┆ 2022-01-07 ┆ 17 ┆ 33 │
│ A ┆ 2022-01-10 ┆ 4 ┆ 21 │
│ … ┆ … ┆ … ┆ … │
│ C ┆ 2022-01-21 ┆ 17 ┆ 78 │
│ C ┆ 2022-01-24 ┆ 84 ┆ 101 │
│ C ┆ 2022-01-27 ┆ 78 ┆ 162 │
│ C ┆ 2022-01-30 ┆ 54 ┆ 132 │
└───────┴────────────┴───────┴────────────┘
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论