2023年3月21日 02:25:17go评论86阅读模式

英文:

group_rolling per group in polars

问题

以下是翻译好的代码部分：

让我们假设我有以下的 DataFrame：
example_df = pl.DataFrame(
    {
    "date": pl.date_range(low=date(2022, 1, 1), high=date(2022, 1, 30), interval="1d"),
    "group": ["A", "B", "C"]*10,
    'total': np.random.randint(100, size=30)
    }
)
现在，我想在5天的时间窗口内对每个组的总和进行求和。类似于将 groupby 与 groupby_rolling 结合在一起。
我想知道是否有比以下代码更加优雅和高效的方法来实现：
df_list = []
for table_group in example_df['group'].unique():
    temp_df = (
        example_df
        .filter(pl.col('group')==table_group)
        .sort('date')
        .groupby_rolling('date', period='5d')
        .agg(pl.col('total').sum().prefix('roll_'))
    )
    
    temp_df2 = (
        example_df
        .filter(pl.col('group')==table_group)
        .sort('date')
        .select([
            pl.all(),
            temp_df.get_column('roll_total')
            ])
    )
    df_list.append(temp_df2)
pl.concat(df_list)

希望这有所帮助。

英文:

Let's say I have the following df:

example_df = pl.DataFrame(
    {
    &quot;date&quot;: pl.date_range(low=date(2022, 1, 1), high=date(2022, 1, 30), interval=&quot;1d&quot;),
    &quot;group&quot;: [&quot;A&quot;, &quot;B&quot;, &quot;C&quot;]*10,
    &#39;total&#39;: np.random.randint(100, size=30)
    }
)

Now I want to sum the total per group, in a time window of 5 days. Something like combining groupby with groupby_rolling.

I wonder if there more elegant and efficient way to do it than the following code:

df_list = []
for table_group in example_df[&#39;group&#39;].unique():
    temp_df = (
        example_df
        .filter(pl.col(&#39;group&#39;)==table_group)
        .sort(&#39;date&#39;)
        .groupby_rolling(&#39;date&#39;, period=&#39;5d&#39;)
        .agg(pl.col(&#39;total&#39;).sum().prefix(&#39;roll_&#39;))
    )
    
    temp_df2 = (
        example_df
        .filter(pl.col(&#39;group&#39;)==table_group)
        .sort(&#39;date&#39;)
        .select([
            pl.all(),
            temp_df.get_column(&#39;roll_total&#39;)
            ])
    )
    df_list.append(temp_df2)
pl.concat(df_list)

My main issue is that I have millions of rows and over 10,000 groups(and more than 1 column that I want to sum)...
So this code is not so elegant and consumes more memory than I want it to.

答案1

得分: 1

以下是您要翻译的内容：

"It appears that the piece you're missing is the by parameter.

This allows you to specify grouping.

(df.groupby_rolling(index_column="date", by="group", period="5d")
   .agg(
      pl.col("total").last(),
      roll_total = pl.sum("total")))

shape: (30, 4)
┌───────┬────────────┬───────┬────────────┐
│ group ┆ date       ┆ total ┆ roll_total │
│ ---   ┆ ---        ┆ ---   ┆ ---        │
│ str   ┆ date       ┆ i64   ┆ i64        │
╞═══════╪════════════╪═══════╪════════════╡
│ A     ┆ 2022-01-01 ┆ 96    ┆ 96         │
│ A     ┆ 2022-01-04 ┆ 16    ┆ 112        │
│ A     ┆ 2022-01-07 ┆ 17    ┆ 33         │
│ A     ┆ 2022-01-10 ┆ 4     ┆ 21         │
│ …     ┆ …          ┆ …     ┆ …          │
│ C     ┆ 2022-01-21 ┆ 17    ┆ 78         │
│ C     ┆ 2022-01-24 ┆ 84    ┆ 101        │
│ C     ┆ 2022-01-27 ┆ 78    ┆ 162        │
│ C     ┆ 2022-01-30 ┆ 54    ┆ 132        │
└───────┴────────────┴───────┴────────────┘
```"
<details>
<summary>英文:</summary>
It appears that the piece you&#39;re missing is the [`by`](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.groupby_rolling.html) parameter.
This allows you to specify grouping.

(df.groupby_rolling(index_column="date", by="group", period="5d")
.agg(
pl.col("total").last(),
roll_total = pl.sum("total")))

shape: (30, 4)
┌───────┬────────────┬───────┬────────────┐
│ group ┆ date ┆ total ┆ roll_total │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ i64 ┆ i64 │
╞═══════╪════════════╪═══════╪════════════╡
│ A ┆ 2022-01-01 ┆ 96 ┆ 96 │
│ A ┆ 2022-01-04 ┆ 16 ┆ 112 │
│ A ┆ 2022-01-07 ┆ 17 ┆ 33 │
│ A ┆ 2022-01-10 ┆ 4 ┆ 21 │
│ … ┆ … ┆ … ┆ … │
│ C ┆ 2022-01-21 ┆ 17 ┆ 78 │
│ C ┆ 2022-01-24 ┆ 84 ┆ 101 │
│ C ┆ 2022-01-27 ┆ 78 ┆ 162 │
│ C ┆ 2022-01-30 ┆ 54 ┆ 132 │
└───────┴────────────┴───────┴────────────┘


</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

group_rolling per group in polars

问题

答案1

Polars分组洗牌和拆分数据框。

polars.date_range X delta from high or low

数据类型无法写入CSV。

SQL查询以获取从“Marks”表中每个学期中分数连续增加的学生详细信息。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。