Polars在连续分组上的累积总和

huangapple go评论156阅读模式
英文:

Polars cumulative sum over consecutive groups

问题

我有一个类似这样的DataFrame:

| Date       | Group | Value |
|------------|-------|-------|
| 2020-01-01 | 0     | 5     |
| 2020-01-02 | 0     | 8     |
| 2020-01-03 | 0     | 9     |
| 2020-01-01 | 1     | 5     |
| 2020-01-02 | 1     | -1    |
| 2020-01-03 | 1     | 2     |
| 2020-01-01 | 2     | -2    |
| 2020-01-02 | 2     | -1    |
| 2020-01-03 | 2     | 7     |

我想要按“Date”分组并依次按“Group”累积求和,类似于:

| Date       | Group | Value            |
|------------|-------|------------------|
| 2020-01-01 | 0     | 5                |
| 2020-01-02 | 0     | 8                |
| 2020-01-03 | 0     | 9                |
| 2020-01-01 | 1     | 10 (= 5 + 5)     |
| 2020-01-02 | 1     | 7  (= 8 - 1)     |
| 2020-01-03 | 1     | 11 (= 9 + 2)     |
| 2020-01-01 | 2     | 8  (= 5 + 5 - 2) |
| 2020-01-02 | 2     | 6  (= 8 - 1 - 1) |
| 2020-01-03 | 2     | 18 (= 9 + 2 + 7) |

这些值的解释如下:Group 0在Group 1之前,Group 1在Group 2之前。对于Group 0的值,我们无需进行任何操作,累积总和仅为原始值。对于Group 1的值,我们按日期累积Group 0的值。类似地,对于Group 2的值,我们累积Group 1和Group 0的值。

我尝试了通过一个辅助的数据透视表来实现这一点。我通过循环遍历各个Group,并在选择的列的部分上进行累积求和,然后将这些新值添加到新值列表中。然后,我将这些新值替换为原始DataFrame中的一列。

有没有一种不需要循环或所有这些“不优雅”操作的方法?

from io import StringIO

import polars as pl


df = pl.read_csv(StringIO("""
Date,Group,Value
2020-01-01,0,5
2020-01-02,0,8
2020-01-03,0,9
2020-01-01,1,5
2020-01-02,1,-1
2020-01-03,1,2
2020-01-01,2,-2
2020-01-02,2,-1
2020-01-03,2,7
"""), parse_dates=True)

ddf = df.pivot('Value', 'Date', 'Group')

new_vals = []
for i in range(df['Group'].max() + 1):
    new_vals.extend(
        ddf.select([pl.col(f'{j}') for j in range(i+1)])
           .sum(axis=1)
           .to_list()
    )

df.with_column(pl.Series(new_vals).alias('CumSumValue'))
英文:

I have a DataFrame like so:

| Date       | Group | Value |
|------------|-------|-------|
| 2020-01-01 | 0     | 5     |
| 2020-01-02 | 0     | 8     |
| 2020-01-03 | 0     | 9     |
| 2020-01-01 | 1     | 5     |
| 2020-01-02 | 1     | -1    |
| 2020-01-03 | 1     | 2     |
| 2020-01-01 | 2     | -2    |
| 2020-01-02 | 2     | -1    |
| 2020-01-03 | 2     | 7     |

I want to do a cumulative sum grouped by "Date" in the order of the "Group" consecutively, something like:

| Date       | Group | Value            |
|------------|-------|------------------|
| 2020-01-01 | 0     | 5                |
| 2020-01-02 | 0     | 8                |
| 2020-01-03 | 0     | 9                |
| 2020-01-01 | 1     | 10 (= 5 + 5)     |
| 2020-01-02 | 1     | 7  (= 8 - 1)     |
| 2020-01-03 | 1     | 11 (= 9 + 2)     |
| 2020-01-01 | 2     | 8  (= 5 + 5 - 2) |
| 2020-01-02 | 2     | 6  (= 8 - 1 - 1) |
| 2020-01-03 | 2     | 18 (= 9 + 2 + 7) |

The explanation for these values is as follows. Group 0 precedes group 1 and group 1 precedes group 2. For the values of group 0, we need not do anything, cumulative sum up to this group are just the original values. For the values of group 1, we accumulate the values of group 0 for each date. Similarly, for group 2, we accumulate the values of group 1 and group 0.

What I have tried is to do this via a helper pivot table. I do it iteratively by looping over the Groups and doing a cumulative sum over a partial selection of the columns and adding that into a list of new values. Then, I replace these new values with into a column into the original DF.

from io import StringIO

import polars as pl


df = pl.read_csv(StringIO("""
Date,Group,Value
2020-01-01,0,5
2020-01-02,0,8
2020-01-03,0,9
2020-01-01,1,5
2020-01-02,1,-1
2020-01-03,1,2
2020-01-01,2,-2
2020-01-02,2,-1
2020-01-03,2,7
"""), parse_dates=True)

ddf = df.pivot('Value', 'Date', 'Group')

new_vals = []
for i in range(df['Group'].max() + 1):
    new_vals.extend(
        ddf.select([pl.col(f'{j}') for j in range(i+1)])
           .sum(axis=1)
           .to_list()
    )

df.with_column(pl.Series(new_vals).alias('CumSumValue'))

Is there a way to do this without loops or all this "inelegance"?

答案1

得分: 3

假设列已经排序,您可以创建一个关于分组的索引,然后在日期和索引上进行累积求和。

df = df.with_columns(pl.col("Date").cumcount().over("Group").alias("Index"))

df.select((
    pl.col(["Date", "Group"]),
    pl.col("Value").cumsum().over(["Date", "Index"]).alias("Value"),
))

形状:(9, 3)
┌────────────┬───────┬───────┐
│ Date ┆ Group ┆ Value │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ i64 │
╞════════════╪═══════╪═══════╡
│ 2020-01-01 ┆ 0 ┆ 5 │
│ 2020-01-02 ┆ 0 ┆ 8 │
│ 2020-01-03 ┆ 0 ┆ 9 │
│ 2020-01-01 ┆ 1 ┆ 10 │
│ ... ┆ ... ┆ ... │
│ 2020-01-03 ┆ 1 ┆ 11 │
│ 2020-01-01 ┆ 2 ┆ 8 │
│ 2020-01-02 ┆ 2 ┆ 6 │
│ 2020-01-03 ┆ 2 ┆ 18 │
└────────────┴───────┴───────┘

英文:

So assuming that the columns are ordered, you can just create an index over the groups and then cumsum over date and index

df = df.with_columns(pl.col("Date").cumcount().over("Group").alias("Index"))
    
df.select((
    pl.col(["Date", "Group"]),
    pl.col("Value").cumsum().over(["Date", "Index"]).alias("Value"),
))
shape: (9, 3)
┌────────────┬───────┬───────┐
│ Date       ┆ Group ┆ Value │
│ ---        ┆ ---   ┆ ---   │
│ date       ┆ i64   ┆ i64   │
╞════════════╪═══════╪═══════╡
│ 2020-01-01 ┆ 0     ┆ 5     │
│ 2020-01-02 ┆ 0     ┆ 8     │
│ 2020-01-03 ┆ 0     ┆ 9     │
│ 2020-01-01 ┆ 1     ┆ 10    │
│ ...        ┆ ...   ┆ ...   │
│ 2020-01-03 ┆ 1     ┆ 11    │
│ 2020-01-01 ┆ 2     ┆ 8     │
│ 2020-01-02 ┆ 2     ┆ 6     │
│ 2020-01-03 ┆ 2     ┆ 18    │
└────────────┴───────┴───────┘

huangapple
  • 本文由 发表于 2023年3月7日 14:09:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/75658529.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定