如何在Pandas数据框中,按组创建一个累积值列表?

huangapple go评论74阅读模式
英文:

How to create a cumulative list of values, by group, in a Pandas dataframe?

问题

我试图向DataFrame添加一个新列,该列由另一列的累积列表(按组)组成。

例如:

df = pd.DataFrame(data={'group1': [1, 1, 2, 2, 2], 'value': [1, 2, 3, 4, 5]})

期望的输出:

   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

实现这一目标的最佳方法是什么?

我尝试过的一种方法不起作用:

df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].cumsum()

这会引发错误:

TypeError: cumsum is not supported for object dtype

编辑:
更明确地说,我想弄清楚为什么这不起作用,以及寻找最快的方法,因为我打算在大型数据框上使用它。

英文:

I'm trying to add a new column to the DataFrame, that consists of a cumulative list (by group) of another column.

For example:

df = pd.DataFrame(data={'group1': [1, 1, 2, 2, 2], 'value': [1, 2, 3, 4, 5]})

Expected output:

   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

What is the best way to accomplish this?

One way I've tried that doesn't work:

df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].cumsum()

This throws the error:

TypeError: cumsum is not supported for object dtype

EDIT:
To be clearer, I'm looking to find out why this is not working + looking for the fastest way for this to happen — as I'm looking to use it on big dataframes.

答案1

得分: 1

使用GroupBy.transform与lambda函数:

f = lambda x: [list(x)[:i] for i, y in enumerate(x, 1)]
df['cumsum_column'] = df.groupby('group1')['value'].transform(f)  
print (df)
   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

感谢@mozway提供的改进解决方案:

g = df.groupby('group1')['value']
d = g.agg(list)
df['cumsum_column'] = [d[k][:i] for k, grp in g for i, x in enumerate(grp, 1)]

我正在寻找为什么这不起作用的原因。

在我看来,pandas开发人员希望GroupBy.cumsum只适用于数值数据。


您的解决方案适用于Series.cumsum

df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())
print (df)
   group1  value value_list cumsum_column
0       1      1        [1]           [1]
1       1      2        [2]        [1, 2]
2       2      3        [3]           [3]
3       2      4        [4]        [3, 4]
4       2      5        [5]     [3, 4, 5]
英文:

Use GroupBy.transform with lambda function:

f = lambda x: [list(x)[:i] for i, y in enumerate(x, 1)]
df['cumsum_column'] = df.groupby('group1')['value'].transform(f)  
print (df)
   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

Thank you @mozway for improved solution:

g = df.groupby('group1')['value']
d = g.agg(list)
df['cumsum_column'] = [d[k][:i] for k, grp in g for i, x in enumerate(grp, 1)]

> I'm looking to find out why this is not working

In my opinion pandas devs want performant solution for GroupBy.cumsum working only with numeric data.


Your solution working with Series.cumsum:

df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())
print (df)
   group1  value value_list cumsum_column
0       1      1        [1]           [1]
1       1      2        [2]        [1, 2]
2       2      3        [3]           [3]
3       2      4        [4]        [3, 4]
4       2      5        [5]     [3, 4, 5]

答案2

得分: 1

以下是您要翻译的代码部分:

def accumulate(s):
    out = [[]]
    for x in s:
        out.append(out[-1]+[x])
    return out[1:]

df['cumsum_column'] = df.groupby('group1')['value'].transform(accumulate)

输出结果:

   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

“为什么您的解决方案失败?”:

因为groupby.cumsum仅适用于数值数据(因此会出现“TypeError: cumsum is not supported for object dtype”错误)。

您需要在transform中使用 lambda 表达式(df.groupby('group1')['value_list'].transform(lambda x: x.cumsum()))。

定时:

在包含 100 个组的 10 万行上进行测试。

%%timeit
df['cumsum_column'] = df.groupby('group1')['value'].transform(accumulate)
# 199 ms ± 12.7 ms 每次循环(平均值 ± 7 次运行的标准差,1 次循环每次)

%%timeit
df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())
# 207 ms ± 7.33 ms 每次循环(平均值 ± 7 次运行的标准差,1 次循环每次)

%%timeit
f = lambda x: [list(x)[:i] for i, y in enumerate(x, 1)]
df['cumsum_column'] = df.groupby('group1')['value'].transform(f)
# 6.65 s ± 483 ms 每次循环(平均值 ± 7 次运行的标准差,1 次循环每次)

### 修复其他解决方案的逻辑以提高运行速度
%%timeit
g = df.groupby('group1')['value']
d = g.agg(list)
df['cumsum_column'] = [d[k][:i] for k, grp in g for i, x in enumerate(grp, start=1)]
# 207 ms ± 10.3 ms 每次循环(平均值 ± 7 次运行的标准差,10 次循环每次)

如何在Pandas数据框中,按组创建一个累积值列表?

英文:

You can use a custom function in groupby.transform:

def accumulate(s):
    out = [[]]
    for x in s:
        out.append(out[-1]+[x])
    return out[1:]

df['cumsum_column'] = df.groupby('group1')['value'].transform(accumulate)

Output:

   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

Why your solution failed?

because groupby.cumsum is restricted to numeric data (Thus the "TypeError: cumsum is not supported for object dtype" error).

You would have needed to use a lambda in transform (df.groupby('group1')['value_list'].transform(lambda x: x.cumsum()).

timings:

Tested on 100k rows with 100 groups.

%%timeit
df['cumsum_column'] = df.groupby('group1')['value'].transform(accumulate)
# 199 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())
# 207 ms ± 7.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
f = lambda x: [list(x)[:i] for i, y in enumerate(x, 1)]
df['cumsum_column'] = df.groupby('group1')['value'].transform(f)
# 6.65 s ± 483 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

### fix of the logic of the other solution to run faster
%%timeit
g = df.groupby('group1')['value']
d = g.agg(list)
df['cumsum_column'] = [d[k][:i] for k, grp in g for i, x in enumerate(grp, start=1)]
# 207 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

如何在Pandas数据框中,按组创建一个累积值列表?

huangapple
  • 本文由 发表于 2023年2月27日 17:35:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/75578764.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定