问题

我有一个类似这样的DataFrame：

| Date       | Group | Value |
|------------|-------|-------|
| 2020-01-01 | 0     | 5     |
| 2020-01-02 | 0     | 8     |
| 2020-01-03 | 0     | 9     |
| 2020-01-01 | 1     | 5     |
| 2020-01-02 | 1     | -1    |
| 2020-01-03 | 1     | 2     |
| 2020-01-01 | 2     | -2    |
| 2020-01-02 | 2     | -1    |
| 2020-01-03 | 2     | 7     |

我想要按“Date”分组并依次按“Group”累积求和，类似于：

| Date       | Group | Value            |
|------------|-------|------------------|
| 2020-01-01 | 0     | 5                |
| 2020-01-02 | 0     | 8                |
| 2020-01-03 | 0     | 9                |
| 2020-01-01 | 1     | 10 (= 5 + 5)     |
| 2020-01-02 | 1     | 7  (= 8 - 1)     |
| 2020-01-03 | 1     | 11 (= 9 + 2)     |
| 2020-01-01 | 2     | 8  (= 5 + 5 - 2) |
| 2020-01-02 | 2     | 6  (= 8 - 1 - 1) |
| 2020-01-03 | 2     | 18 (= 9 + 2 + 7) |

这些值的解释如下：Group 0在Group 1之前，Group 1在Group 2之前。对于Group 0的值，我们无需进行任何操作，累积总和仅为原始值。对于Group 1的值，我们按日期累积Group 0的值。类似地，对于Group 2的值，我们累积Group 1和Group 0的值。

我尝试了通过一个辅助的数据透视表来实现这一点。我通过循环遍历各个Group，并在选择的列的部分上进行累积求和，然后将这些新值添加到新值列表中。然后，我将这些新值替换为原始DataFrame中的一列。

有没有一种不需要循环或所有这些“不优雅”操作的方法？

from io import StringIO

import polars as pl


df = pl.read_csv(StringIO(&quot;&quot;&quot;
Date,Group,Value
2020-01-01,0,5
2020-01-02,0,8
2020-01-03,0,9
2020-01-01,1,5
2020-01-02,1,-1
2020-01-03,1,2
2020-01-01,2,-2
2020-01-02,2,-1
2020-01-03,2,7
&quot;&quot;&quot;), parse_dates=True)

ddf = df.pivot(&#39;Value&#39;, &#39;Date&#39;, &#39;Group&#39;)

new_vals = []
for i in range(df[&#39;Group&#39;].max() + 1):
    new_vals.extend(
        ddf.select([pl.col(f&#39;{j}&#39;) for j in range(i+1)])
           .sum(axis=1)
           .to_list()
    )

df.with_column(pl.Series(new_vals).alias(&#39;CumSumValue&#39;))

英文:

I have a DataFrame like so:

| Date       | Group | Value |
|------------|-------|-------|
| 2020-01-01 | 0     | 5     |
| 2020-01-02 | 0     | 8     |
| 2020-01-03 | 0     | 9     |
| 2020-01-01 | 1     | 5     |
| 2020-01-02 | 1     | -1    |
| 2020-01-03 | 1     | 2     |
| 2020-01-01 | 2     | -2    |
| 2020-01-02 | 2     | -1    |
| 2020-01-03 | 2     | 7     |

I want to do a cumulative sum grouped by "Date" in the order of the "Group" consecutively, something like:

| Date       | Group | Value            |
|------------|-------|------------------|
| 2020-01-01 | 0     | 5                |
| 2020-01-02 | 0     | 8                |
| 2020-01-03 | 0     | 9                |
| 2020-01-01 | 1     | 10 (= 5 + 5)     |
| 2020-01-02 | 1     | 7  (= 8 - 1)     |
| 2020-01-03 | 1     | 11 (= 9 + 2)     |
| 2020-01-01 | 2     | 8  (= 5 + 5 - 2) |
| 2020-01-02 | 2     | 6  (= 8 - 1 - 1) |
| 2020-01-03 | 2     | 18 (= 9 + 2 + 7) |

The explanation for these values is as follows. Group 0 precedes group 1 and group 1 precedes group 2. For the values of group 0, we need not do anything, cumulative sum up to this group are just the original values. For the values of group 1, we accumulate the values of group 0 for each date. Similarly, for group 2, we accumulate the values of group 1 and group 0.

What I have tried is to do this via a helper pivot table. I do it iteratively by looping over the Groups and doing a cumulative sum over a partial selection of the columns and adding that into a list of new values. Then, I replace these new values with into a column into the original DF.

from io import StringIO

import polars as pl


df = pl.read_csv(StringIO(&quot;&quot;&quot;
Date,Group,Value
2020-01-01,0,5
2020-01-02,0,8
2020-01-03,0,9
2020-01-01,1,5
2020-01-02,1,-1
2020-01-03,1,2
2020-01-01,2,-2
2020-01-02,2,-1
2020-01-03,2,7
&quot;&quot;&quot;), parse_dates=True)

ddf = df.pivot(&#39;Value&#39;, &#39;Date&#39;, &#39;Group&#39;)

new_vals = []
for i in range(df[&#39;Group&#39;].max() + 1):
    new_vals.extend(
        ddf.select([pl.col(f&#39;{j}&#39;) for j in range(i+1)])
           .sum(axis=1)
           .to_list()
    )

df.with_column(pl.Series(new_vals).alias(&#39;CumSumValue&#39;))

Is there a way to do this without loops or all this "inelegance"?

答案1

得分: 3

假设列已经排序，您可以创建一个关于分组的索引，然后在日期和索引上进行累积求和。

df = df.with_columns(pl.col("Date").cumcount().over("Group").alias("Index"))

df.select((
    pl.col(["Date", "Group"]),
    pl.col("Value").cumsum().over(["Date", "Index"]).alias("Value"),
))

形状：(9, 3)
┌────────────┬───────┬───────┐
│ Date ┆ Group ┆ Value │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ i64 │
╞════════════╪═══════╪═══════╡
│ 2020-01-01 ┆ 0 ┆ 5 │
│ 2020-01-02 ┆ 0 ┆ 8 │
│ 2020-01-03 ┆ 0 ┆ 9 │
│ 2020-01-01 ┆ 1 ┆ 10 │
│ ... ┆ ... ┆ ... │
│ 2020-01-03 ┆ 1 ┆ 11 │
│ 2020-01-01 ┆ 2 ┆ 8 │
│ 2020-01-02 ┆ 2 ┆ 6 │
│ 2020-01-03 ┆ 2 ┆ 18 │
└────────────┴───────┴───────┘

英文:

So assuming that the columns are ordered, you can just create an index over the groups and then cumsum over date and index

df = df.with_columns(pl.col(&quot;Date&quot;).cumcount().over(&quot;Group&quot;).alias(&quot;Index&quot;))
    
df.select((
    pl.col([&quot;Date&quot;, &quot;Group&quot;]),
    pl.col(&quot;Value&quot;).cumsum().over([&quot;Date&quot;, &quot;Index&quot;]).alias(&quot;Value&quot;),
))

shape: (9, 3)
┌────────────┬───────┬───────┐
│ Date       ┆ Group ┆ Value │
│ ---        ┆ ---   ┆ ---   │
│ date       ┆ i64   ┆ i64   │
╞════════════╪═══════╪═══════╡
│ 2020-01-01 ┆ 0     ┆ 5     │
│ 2020-01-02 ┆ 0     ┆ 8     │
│ 2020-01-03 ┆ 0     ┆ 9     │
│ 2020-01-01 ┆ 1     ┆ 10    │
│ ...        ┆ ...   ┆ ...   │
│ 2020-01-03 ┆ 1     ┆ 11    │
│ 2020-01-01 ┆ 2     ┆ 8     │
│ 2020-01-02 ┆ 2     ┆ 6     │
│ 2020-01-03 ┆ 2     ┆ 18    │
└────────────┴───────┴───────┘

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Polars在连续分组上的累积总和

问题

答案1

How to convert polars dataframe column type from float64 to int64?

如何在polars中将时间持续时间转换为数值？

polars使用DataFrame的行与Expression API。

在Python Polars中筛选时区感知的日期时间时的偏移量

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论