英文:
Polars cumulative sum over consecutive groups
问题
我有一个类似这样的DataFrame:
| Date | Group | Value |
|------------|-------|-------|
| 2020-01-01 | 0 | 5 |
| 2020-01-02 | 0 | 8 |
| 2020-01-03 | 0 | 9 |
| 2020-01-01 | 1 | 5 |
| 2020-01-02 | 1 | -1 |
| 2020-01-03 | 1 | 2 |
| 2020-01-01 | 2 | -2 |
| 2020-01-02 | 2 | -1 |
| 2020-01-03 | 2 | 7 |
我想要按“Date”分组并依次按“Group”累积求和,类似于:
| Date | Group | Value |
|------------|-------|------------------|
| 2020-01-01 | 0 | 5 |
| 2020-01-02 | 0 | 8 |
| 2020-01-03 | 0 | 9 |
| 2020-01-01 | 1 | 10 (= 5 + 5) |
| 2020-01-02 | 1 | 7 (= 8 - 1) |
| 2020-01-03 | 1 | 11 (= 9 + 2) |
| 2020-01-01 | 2 | 8 (= 5 + 5 - 2) |
| 2020-01-02 | 2 | 6 (= 8 - 1 - 1) |
| 2020-01-03 | 2 | 18 (= 9 + 2 + 7) |
这些值的解释如下:Group 0在Group 1之前,Group 1在Group 2之前。对于Group 0的值,我们无需进行任何操作,累积总和仅为原始值。对于Group 1的值,我们按日期累积Group 0的值。类似地,对于Group 2的值,我们累积Group 1和Group 0的值。
我尝试了通过一个辅助的数据透视表来实现这一点。我通过循环遍历各个Group,并在选择的列的部分上进行累积求和,然后将这些新值添加到新值列表中。然后,我将这些新值替换为原始DataFrame中的一列。
有没有一种不需要循环或所有这些“不优雅”操作的方法?
from io import StringIO
import polars as pl
df = pl.read_csv(StringIO("""
Date,Group,Value
2020-01-01,0,5
2020-01-02,0,8
2020-01-03,0,9
2020-01-01,1,5
2020-01-02,1,-1
2020-01-03,1,2
2020-01-01,2,-2
2020-01-02,2,-1
2020-01-03,2,7
"""), parse_dates=True)
ddf = df.pivot('Value', 'Date', 'Group')
new_vals = []
for i in range(df['Group'].max() + 1):
new_vals.extend(
ddf.select([pl.col(f'{j}') for j in range(i+1)])
.sum(axis=1)
.to_list()
)
df.with_column(pl.Series(new_vals).alias('CumSumValue'))
英文:
I have a DataFrame like so:
| Date | Group | Value |
|------------|-------|-------|
| 2020-01-01 | 0 | 5 |
| 2020-01-02 | 0 | 8 |
| 2020-01-03 | 0 | 9 |
| 2020-01-01 | 1 | 5 |
| 2020-01-02 | 1 | -1 |
| 2020-01-03 | 1 | 2 |
| 2020-01-01 | 2 | -2 |
| 2020-01-02 | 2 | -1 |
| 2020-01-03 | 2 | 7 |
I want to do a cumulative sum grouped by "Date" in the order of the "Group" consecutively, something like:
| Date | Group | Value |
|------------|-------|------------------|
| 2020-01-01 | 0 | 5 |
| 2020-01-02 | 0 | 8 |
| 2020-01-03 | 0 | 9 |
| 2020-01-01 | 1 | 10 (= 5 + 5) |
| 2020-01-02 | 1 | 7 (= 8 - 1) |
| 2020-01-03 | 1 | 11 (= 9 + 2) |
| 2020-01-01 | 2 | 8 (= 5 + 5 - 2) |
| 2020-01-02 | 2 | 6 (= 8 - 1 - 1) |
| 2020-01-03 | 2 | 18 (= 9 + 2 + 7) |
The explanation for these values is as follows. Group 0 precedes group 1 and group 1 precedes group 2. For the values of group 0, we need not do anything, cumulative sum up to this group are just the original values. For the values of group 1, we accumulate the values of group 0 for each date. Similarly, for group 2, we accumulate the values of group 1 and group 0.
What I have tried is to do this via a helper pivot table. I do it iteratively by looping over the Groups and doing a cumulative sum over a partial selection of the columns and adding that into a list of new values. Then, I replace these new values with into a column into the original DF.
from io import StringIO
import polars as pl
df = pl.read_csv(StringIO("""
Date,Group,Value
2020-01-01,0,5
2020-01-02,0,8
2020-01-03,0,9
2020-01-01,1,5
2020-01-02,1,-1
2020-01-03,1,2
2020-01-01,2,-2
2020-01-02,2,-1
2020-01-03,2,7
"""), parse_dates=True)
ddf = df.pivot('Value', 'Date', 'Group')
new_vals = []
for i in range(df['Group'].max() + 1):
new_vals.extend(
ddf.select([pl.col(f'{j}') for j in range(i+1)])
.sum(axis=1)
.to_list()
)
df.with_column(pl.Series(new_vals).alias('CumSumValue'))
Is there a way to do this without loops or all this "inelegance"?
答案1
得分: 3
假设列已经排序,您可以创建一个关于分组的索引,然后在日期和索引上进行累积求和。
df = df.with_columns(pl.col("Date").cumcount().over("Group").alias("Index"))
df.select((
pl.col(["Date", "Group"]),
pl.col("Value").cumsum().over(["Date", "Index"]).alias("Value"),
))
形状:(9, 3)
┌────────────┬───────┬───────┐
│ Date ┆ Group ┆ Value │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ i64 │
╞════════════╪═══════╪═══════╡
│ 2020-01-01 ┆ 0 ┆ 5 │
│ 2020-01-02 ┆ 0 ┆ 8 │
│ 2020-01-03 ┆ 0 ┆ 9 │
│ 2020-01-01 ┆ 1 ┆ 10 │
│ ... ┆ ... ┆ ... │
│ 2020-01-03 ┆ 1 ┆ 11 │
│ 2020-01-01 ┆ 2 ┆ 8 │
│ 2020-01-02 ┆ 2 ┆ 6 │
│ 2020-01-03 ┆ 2 ┆ 18 │
└────────────┴───────┴───────┘
英文:
So assuming that the columns are ordered, you can just create an index over the groups and then cumsum over date and index
df = df.with_columns(pl.col("Date").cumcount().over("Group").alias("Index"))
df.select((
pl.col(["Date", "Group"]),
pl.col("Value").cumsum().over(["Date", "Index"]).alias("Value"),
))
shape: (9, 3)
┌────────────┬───────┬───────┐
│ Date ┆ Group ┆ Value │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ i64 │
╞════════════╪═══════╪═══════╡
│ 2020-01-01 ┆ 0 ┆ 5 │
│ 2020-01-02 ┆ 0 ┆ 8 │
│ 2020-01-03 ┆ 0 ┆ 9 │
│ 2020-01-01 ┆ 1 ┆ 10 │
│ ... ┆ ... ┆ ... │
│ 2020-01-03 ┆ 1 ┆ 11 │
│ 2020-01-01 ┆ 2 ┆ 8 │
│ 2020-01-02 ┆ 2 ┆ 6 │
│ 2020-01-03 ┆ 2 ┆ 18 │
└────────────┴───────┴───────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论