2023年2月27日 17:35:19go评论79阅读模式

英文:

How to create a cumulative list of values, by group, in a Pandas dataframe?

问题

我试图向DataFrame添加一个新列，该列由另一列的累积列表（按组）组成。

例如：

df = pd.DataFrame(data={'group1': [1, 1, 2, 2, 2], 'value': [1, 2, 3, 4, 5]})

期望的输出：

   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

实现这一目标的最佳方法是什么？

我尝试过的一种方法不起作用：

df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].cumsum()

这会引发错误：

TypeError: cumsum is not supported for object dtype

编辑：
更明确地说，我想弄清楚为什么这不起作用，以及寻找最快的方法，因为我打算在大型数据框上使用它。

英文:

I'm trying to add a new column to the DataFrame, that consists of a cumulative list (by group) of another column.

For example:

df = pd.DataFrame(data={&#39;group1&#39;: [1, 1, 2, 2, 2], &#39;value&#39;: [1, 2, 3, 4, 5]})

Expected output:

   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

What is the best way to accomplish this?

One way I've tried that doesn't work:

df[&#39;value_list&#39;] = [[i] for i in df[&#39;value&#39;]]
df[&#39;cumsum_column&#39;] = df.groupby(&#39;group1&#39;)[&#39;value_list&#39;].cumsum()

This throws the error:

TypeError: cumsum is not supported for object dtype

EDIT:
To be clearer, I'm looking to find out why this is not working + looking for the fastest way for this to happen — as I'm looking to use it on big dataframes.

答案1

得分: 1

使用GroupBy.transform与lambda函数：

f = lambda x: [list(x)[:i] for i, y in enumerate(x, 1)]
df['cumsum_column'] = df.groupby('group1')['value'].transform(f)  
print (df)
   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

感谢@mozway提供的改进解决方案：

g = df.groupby('group1')['value']
d = g.agg(list)
df['cumsum_column'] = [d[k][:i] for k, grp in g for i, x in enumerate(grp, 1)]

我正在寻找为什么这不起作用的原因。

在我看来，pandas开发人员希望GroupBy.cumsum只适用于数值数据。

您的解决方案适用于Series.cumsum：

df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())
print (df)
   group1  value value_list cumsum_column
0       1      1        [1]           [1]
1       1      2        [2]        [1, 2]
2       2      3        [3]           [3]
3       2      4        [4]        [3, 4]
4       2      5        [5]     [3, 4, 5]

英文:

Use GroupBy.transform with lambda function:

f = lambda x: [list(x)[:i] for i, y in enumerate(x, 1)]
df[&#39;cumsum_column&#39;] = df.groupby(&#39;group1&#39;)[&#39;value&#39;].transform(f)  
print (df)
   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

Thank you @mozway for improved solution:

g = df.groupby(&#39;group1&#39;)[&#39;value&#39;]
d = g.agg(list)
df[&#39;cumsum_column&#39;] = [d[k][:i] for k, grp in g for i, x in enumerate(grp, 1)]

> I'm looking to find out why this is not working

In my opinion pandas devs want performant solution for GroupBy.cumsum working only with numeric data.

Your solution working with Series.cumsum:

df[&#39;value_list&#39;] = [[i] for i in df[&#39;value&#39;]]
df[&#39;cumsum_column&#39;] = df.groupby(&#39;group1&#39;)[&#39;value_list&#39;].transform(lambda x: x.cumsum())
print (df)
   group1  value value_list cumsum_column
0       1      1        [1]           [1]
1       1      2        [2]        [1, 2]
2       2      3        [3]           [3]
3       2      4        [4]        [3, 4]
4       2      5        [5]     [3, 4, 5]

答案2

得分: 1

以下是您要翻译的代码部分：

def accumulate(s):
    out = [[]]
    for x in s:
        out.append(out[-1]+[x])
    return out[1:]

df['cumsum_column'] = df.groupby('group1')['value'].transform(accumulate)

输出结果：

   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

“为什么您的解决方案失败？”：

因为groupby.cumsum仅适用于数值数据（因此会出现“TypeError: cumsum is not supported for object dtype”错误）。

您需要在transform中使用 lambda 表达式（df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())）。

定时：

在包含 100 个组的 10 万行上进行测试。

%%timeit
df['cumsum_column'] = df.groupby('group1')['value'].transform(accumulate)
# 199 ms ± 12.7 ms 每次循环（平均值 ± 7 次运行的标准差，1 次循环每次）

%%timeit
df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())
# 207 ms ± 7.33 ms 每次循环（平均值 ± 7 次运行的标准差，1 次循环每次）

%%timeit
f = lambda x: [list(x)[:i] for i, y in enumerate(x, 1)]
df['cumsum_column'] = df.groupby('group1')['value'].transform(f)
# 6.65 s ± 483 ms 每次循环（平均值 ± 7 次运行的标准差，1 次循环每次）

### 修复其他解决方案的逻辑以提高运行速度
%%timeit
g = df.groupby('group1')['value']
d = g.agg(list)
df['cumsum_column'] = [d[k][:i] for k, grp in g for i, x in enumerate(grp, start=1)]
# 207 ms ± 10.3 ms 每次循环（平均值 ± 7 次运行的标准差，10 次循环每次）

英文:

You can use a custom function in groupby.transform:

def accumulate(s):
    out = [[]]
    for x in s:
        out.append(out[-1]+[x])
    return out[1:]

df[&#39;cumsum_column&#39;] = df.groupby(&#39;group1&#39;)[&#39;value&#39;].transform(accumulate)

Output:

   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

Why your solution failed?

because groupby.cumsum is restricted to numeric data (Thus the "TypeError: cumsum is not supported for object dtype" error).

You would have needed to use a lambda in transform (df.groupby('group1')['value_list'].transform(lambda x: x.cumsum()).

timings:

Tested on 100k rows with 100 groups.

%%timeit
df[&#39;cumsum_column&#39;] = df.groupby(&#39;group1&#39;)[&#39;value&#39;].transform(accumulate)
# 199 ms &#177; 12.7 ms per loop (mean &#177; std. dev. of 7 runs, 1 loop each)

%%timeit
df[&#39;value_list&#39;] = [[i] for i in df[&#39;value&#39;]]
df[&#39;cumsum_column&#39;] = df.groupby(&#39;group1&#39;)[&#39;value_list&#39;].transform(lambda x: x.cumsum())
# 207 ms &#177; 7.33 ms per loop (mean &#177; std. dev. of 7 runs, 1 loop each)

%%timeit
f = lambda x: [list(x)[:i] for i, y in enumerate(x, 1)]
df[&#39;cumsum_column&#39;] = df.groupby(&#39;group1&#39;)[&#39;value&#39;].transform(f)
# 6.65 s &#177; 483 ms per loop (mean &#177; std. dev. of 7 runs, 1 loop each)

### fix of the logic of the other solution to run faster
%%timeit
g = df.groupby(&#39;group1&#39;)[&#39;value&#39;]
d = g.agg(list)
df[&#39;cumsum_column&#39;] = [d[k][:i] for k, grp in g for i, x in enumerate(grp, start=1)]
# 207 ms &#177; 10.3 ms per loop (mean &#177; std. dev. of 7 runs, 10 loops each)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Pandas数据框中，按组创建一个累积值列表？

问题

答案1

答案2

Why your solution failed?

timings:

Dataflow – 将 JSON 文件添加到 BigQuery

TensorFlow 对整数进行是否能被3整除的分类不起作用

sp_execute_external_script 无法找到由 setuptools 安装的模块。

如何在循环中对文本中的数字进行排序

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论