2023年6月29日 11:24:10go评论81阅读模式

英文:

sum rows with condition and groupby

问题

我有一个以下的数据框：

A	B	Percent	Groupby
2	0	10	All
2	1	5	All
2	2	6	All
2	0	20	Type A
2	1	15	Type A
2	2	8	Type A
3	0	10	All
3	1	5	All
3	2	6	All
3	3	3	All
3	0	20	Type A
3	1	15	Type A
3	2	8	Type A
3	3	11	Type A
4	0	10	All
4	1	5	All
4	2	6	All
4	3	3	All
4	4	1	All
4	0	20	Type A
4	1	15	Type A
4	2	8	Type A
4	3	11	Type A
4	4	2	Type A

我想要得到这个结果：

A	B	Percent	Groupby	sum
2	0	10	All	10
2	1	5	All	11
2	2	6	All	11
2	0	20	Type A	20
2	1	15	Type A	23
2	2	8	Type A	23
3	0	10	All	15
3	1	5	All	15
3	2	6	All	9
3	3	3	All	9
3	0	20	Type A	35
3	1	15	Type A	35
3	2	8	Type A	19
3	3	11	Type A	19
4	0	10	All	15
4	1	5	All	15
4	2	6	All	10
4	3	3	All	10
4	4	1	All	10
4	0	20	Type A	35
4	1	15	Type A	35
4	2	8	Type A	21
4	3	11	Type A	21
4	4	2	Type A	21

计算是按照 "Groupby" 进行的：

如果列 A 为 2，则将百分比相加，其中列 B 的值为 1 和 2。
如果列 A 为 3，则将百分比相加，其中列 B 的值为 0 和 1，以及 2 和 3。
如果列 A 为 4，则将百分比相加，其中列 B 的值为 0 和 1，以及 2、3 和 4。

有没有快速的方法来做到这一点？谢谢。

英文:

I have a dataframe below:

A	B	Percent	Groupby
2	0	10	All
2	1	5	All
2	2	6	All
2	0	20	Type A
2	1	15	Type A
2	2	8	Type A
3	0	10	All
3	1	5	All
3	2	6	All
3	3	3	All
3	0	20	Type A
3	1	15	Type A
3	2	8	Type A
3	3	11	Type A
4	0	10	All
4	1	5	All
4	2	6	All
4	3	3	All
4	4	1	All
4	0	20	Type A
4	1	15	Type A
4	2	8	Type A
4	3	11	Type A
4	4	2	Type A

I would like to get this result:

A	B	Percent	Groupby	sum
2	0	10	All	10
2	1	5	All	11
2	2	6	All	11
2	0	20	Type A	20
2	1	15	Type A	23
2	2	8	Type A	23
3	0	10	All	15
3	1	5	All	15
3	2	6	All	9
3	3	3	All	9
3	0	20	Type A	35
3	1	15	Type A	35
3	2	8	Type A	19
3	3	11	Type A	19
4	0	10	All	15
4	1	5	All	15
4	2	6	All	10
4	3	3	All	10
4	4	1	All	10
4	0	20	Type A	35
4	1	15	Type A	35
4	2	8	Type A	21
4	3	11	Type A	21
4	4	2	Type A	21

The calculation is groupby "Groupby"

if Col A is 2 then percent is summed for Col B with values of 1 and 2.
if Col A is 3 then percent is summed for Col B with values of 0 and 1, and 2 and 3.
if Col A is 4 then percent is summed for Col B with values of 0 and 1, and 2, 3, and 4.

Is there a quick way to do this? Thank you.

答案1

得分: 3

With your edit, it can be interesting to change the strategy. What you should do is to create subgroups for each condition:

conds = [
#   (2, [0]),
    (2, [1, 2]),
    (3, [0, 1]),
    (3, [2, 3]),
    (4, [0, 1]),
    (4, [2, 3, 4]),
]
# Create subgroups according to your conditions
g = sum([i * (df['A'].eq(a) & df['B'].isin(b))
           for i, (a, b) in enumerate(conds, 1)]).mask(lambda x: x==0)
# Same with np.select
# masks = [(df['A'].eq(a) & df['B'].isin(b)) for a, b in conds]
# choices = 1 + np.arange(len(masks))
# g = np.select(masks, choices, default=np.nan)
# Group by A and Groupby columns and your new subgroups
df['sum'] = (df.groupby(['A', 'Groupby', g])['Percent']
               .transform('sum').fillna(df['Percent']))

Note: in fact, to avoid fillna, you can be explicit and append (2, [0]) to the condition list to match every combination.

英文:

With your edit, it can be interesting to change the strategy. What you should do is to create subgroups for each condition:

conds = [
#   (2, [0]),
    (2, [1, 2]),
    (3, [0, 1]),
    (3, [2, 3]),
    (4, [0, 1]),
    (4, [2, 3, 4]),
]
# Create subgroups according your conditions
g = sum([i * (df[&#39;A&#39;].eq(a) &amp; df[&#39;B&#39;].isin(b))
           for i, (a, b) in enumerate(conds, 1)]).mask(lambda x: x==0)
# Same with np.select
# masks = [(df[&#39;A&#39;].eq(a) &amp; df[&#39;B&#39;].isin(b)) for a, b in conds]
# choices = 1 + np.arange(len(masks))
# g = np.select(masks, choices, default=np.nan)
# Group by A and Groupby columns and your new subgroups
df[&#39;sum&#39;] = (df.groupby([&#39;A&#39;, &#39;Groupby&#39;, g])[&#39;Percent&#39;]
               .transform(&#39;sum&#39;).fillna(df[&#39;Percent&#39;]))

Note: in fact, to avoid fillna, you can be explicit and append (2, [0]) to the condition list to match every combinations.

The output is now:

&gt;&gt;&gt; df
    A  B  Percent Groupby   sum
0   2  0       10     All  10.0
1   2  1        5     All  11.0
2   2  2        6     All  11.0
3   2  0       20  Type A  20.0
4   2  1       15  Type A  23.0
5   2  2        8  Type A  23.0
6   3  0       10     All  15.0
7   3  1        5     All  15.0
8   3  2        6     All   9.0
9   3  3        3     All   9.0
10  3  0       20  Type A  35.0
11  3  1       15  Type A  35.0
12  3  2        8  Type A  19.0
13  3  3       11  Type A  19.0
14  4  0       10     All  15.0
15  4  1        5     All  15.0
16  4  2        6     All  10.0
17  4  3        3     All  10.0
18  4  4        1     All  10.0
19  4  0       20  Type A  35.0
20  4  1       15  Type A  35.0
21  4  2        8  Type A  21.0
22  4  3       11  Type A  21.0
23  4  4        2  Type A  21.0

And subgroups:

&gt;&gt;&gt; g
0     NaN
1     1.0
2     1.0
3     NaN
4     1.0
5     1.0
6     2.0
7     2.0
8     3.0
9     3.0
10    2.0
11    2.0
12    3.0
13    3.0
14    4.0
15    4.0
16    5.0
17    5.0
18    5.0
19    4.0
20    4.0
21    5.0
22    5.0
23    5.0
dtype: float64

Old answer

EDIT: As suggested by @jezrael, you can avoid to write all conditions by using a mapping dict and a comprehension:

You can use np.select to match your conditions then use groupby_transform to broadcast the sum on right rows:

# conds = [df[&#39;A&#39;] == 2,
#          df[&#39;A&#39;] == 3,
#          df[&#39;A&#39;] == 4]
# choices = [df[&#39;Percent&#39;].where(df[&#39;B&#39;].isin([1, 2])),
#            df[&#39;Percent&#39;].where(df[&#39;B&#39;].isin([2, 3])),
#            df[&#39;Percent&#39;].where(df[&#39;B&#39;].isin([2, 3, 4]))]
# Equivalent to
dmap = {2: [1, 2], 3: [2, 3], 4: [2, 3, 4]}
#                       conds --v           choices --v
conds, choices = zip(*[(df[&#39;A&#39;] == k, df[&#39;Percent&#39;].where(df[&#39;B&#39;].isin(v)))
                       for k, v in dmap.items()])
df[&#39;sum&#39;] = np.select(conds, choices)
df[&#39;sum&#39;] = (df.mask(df[&#39;sum&#39;].isna())
               .groupby([&#39;Groupby&#39;, &#39;A&#39;])[&#39;sum&#39;]
               .transform(&#39;sum&#39;).fillna(df[&#39;Percent&#39;]))

Output:

&gt;&gt;&gt; df
    A  B  Percent Groupby   sum
0   2  0       10     All  10.0
1   2  1        5     All  11.0
2   2  2        6     All  11.0
3   2  0       20  Type A  20.0
4   2  1       15  Type A  23.0
5   2  2        8  Type A  23.0
6   3  0       10     All  10.0
7   3  1        5     All   5.0
8   3  2        6     All   9.0
9   3  3        3     All   9.0
10  3  0       20  Type A  20.0
11  3  1       15  Type A  15.0
12  3  2        8  Type A  19.0
13  3  3       11  Type A  19.0
14  4  0       10     All  10.0
15  4  1        5     All   5.0
16  4  2        6     All  10.0
17  4  3        3     All  10.0
18  4  4        1     All  10.0
19  4  0       20  Type A  20.0
20  4  1       15  Type A  15.0
21  4  2        8  Type A  21.0
22  4  3       11  Type A  21.0
23  4  4        2  Type A  21.0

After np.select, the result is:

&gt;&gt;&gt; df[&#39;sum&#39;]
0      NaN
1      5.0
2      6.0
3      NaN
4     15.0
5      8.0
6      NaN
7      NaN
8      6.0
9      3.0
10     NaN
11     NaN
12     8.0
13    11.0
14     NaN
15     NaN
16     6.0
17     3.0
18     1.0
19     NaN
20     NaN
21     8.0
22    11.0
23     2.0
Name: sum, dtype: float64

答案2

得分: 2

为避免指定许多条件，创建将A和B组进行映射的字典，转换为DataFrame并使用DataFrame.merge进行right连接，以便在GroupBy.transform中为每个组可能创建新列的sum，仅当通过numpy.where匹配条件时：

d = {2:[1,2], 3:[2,3], 4:[2,3,4]}
df = pd.DataFrame([(k, x) for k, v in d.items() for x in v],
                  columns=['A','B']).merge(df, how='right', indicator=True)
m = df.pop('_merge').eq('both')
df['sum'] = np.where(m, 
                     df.groupby(['Groupby', 'A', m])['Percent'].transform('sum'),
                     df['Percent'])

print (df)
    A  B  Percent Groupby  sum
0   2  0       10     All   10
1   2  1        5     All   11
2   2  2        6     All   11
3   2  0       20  Type A   20
4   2  1       15  Type A   23
5   2  2        8  Type A   23
6   3  0       10     All   10
7   3  1        5     All    5
8   3  2        6     All    9
9   3  3        3     All    9
10  3  0       20  Type A   20
11  3  1       15  Type A   15
12  3  2        8  Type A   19
13  3  3       11  Type A   19
14  4  0       10     All   10
15  4  1        5     All    5
16  4  2        6     All   10
17  4  3        3     All   10
18  4  4        1     All   10
19  4  0       20  Type A   20
20  4  1       15  Type A   15
21  4  2        8  Type A   21
22  4  3       11  Type A   21
23  4  4        2  Type A   21

英文:

For avoid specify many conditions create dictionary for mapping A and B groups, convert to DataFrame and use DataFrame.merge with right join, so possible create sums per groups to new column in GroupBy.transform only if match condition by numpy.where:

d = {2:[1,2], 3:[2,3], 4:[2,3,4]}
df = pd.DataFrame([(k, x) for k, v in d.items() for x in v],
                  columns=[&#39;A&#39;,&#39;B&#39;]).merge(df, how=&#39;right&#39;, indicator=True)
m = df.pop(&#39;_merge&#39;).eq(&#39;both&#39;)
df[&#39;sum&#39;] = np.where(m, 
                     df.groupby([&#39;Groupby&#39;, &#39;A&#39;, m])[&#39;Percent&#39;].transform(&#39;sum&#39;),
                     df[&#39;Percent&#39;])

print (df)
    A  B  Percent Groupby  sum
0   2  0       10     All   10
1   2  1        5     All   11
2   2  2        6     All   11
3   2  0       20  Type A   20
4   2  1       15  Type A   23
5   2  2        8  Type A   23
6   3  0       10     All   10
7   3  1        5     All    5
8   3  2        6     All    9
9   3  3        3     All    9
10  3  0       20  Type A   20
11  3  1       15  Type A   15
12  3  2        8  Type A   19
13  3  3       11  Type A   19
14  4  0       10     All   10
15  4  1        5     All    5
16  4  2        6     All   10
17  4  3        3     All   10
18  4  4        1     All   10
19  4  0       20  Type A   20
20  4  1       15  Type A   15
21  4  2        8  Type A   21
22  4  3       11  Type A   21
23  4  4        2  Type A   21

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

合并符合条件的行并按分组求和

问题

答案1

答案2

在Pandas数据框中通过分组行值来计算平均值。

无法将包含多列的数据框设置为单列。

Writing Datetime to Excel with Pandas

一列在条件下的平均时间差

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。