合并符合条件的行并按分组求和

huangapple go评论55阅读模式
英文:

sum rows with condition and groupby

问题

我有一个以下的数据框:

A B Percent Groupby
2 0 10 All
2 1 5 All
2 2 6 All
2 0 20 Type A
2 1 15 Type A
2 2 8 Type A
3 0 10 All
3 1 5 All
3 2 6 All
3 3 3 All
3 0 20 Type A
3 1 15 Type A
3 2 8 Type A
3 3 11 Type A
4 0 10 All
4 1 5 All
4 2 6 All
4 3 3 All
4 4 1 All
4 0 20 Type A
4 1 15 Type A
4 2 8 Type A
4 3 11 Type A
4 4 2 Type A

我想要得到这个结果:

A B Percent Groupby sum
2 0 10 All 10
2 1 5 All 11
2 2 6 All 11
2 0 20 Type A 20
2 1 15 Type A 23
2 2 8 Type A 23
3 0 10 All 15
3 1 5 All 15
3 2 6 All 9
3 3 3 All 9
3 0 20 Type A 35
3 1 15 Type A 35
3 2 8 Type A 19
3 3 11 Type A 19
4 0 10 All 15
4 1 5 All 15
4 2 6 All 10
4 3 3 All 10
4 4 1 All 10
4 0 20 Type A 35
4 1 15 Type A 35
4 2 8 Type A 21
4 3 11 Type A 21
4 4 2 Type A 21

计算是按照 "Groupby" 进行的:

  • 如果列 A 为 2,则将百分比相加,其中列 B 的值为 1 和 2。
  • 如果列 A 为 3,则将百分比相加,其中列 B 的值为 0 和 1,以及 2 和 3。
  • 如果列 A 为 4,则将百分比相加,其中列 B 的值为 0 和 1,以及 2、3 和 4。

有没有快速的方法来做到这一点?谢谢。

英文:

I have a dataframe below:

A B Percent Groupby
2 0 10 All
2 1 5 All
2 2 6 All
2 0 20 Type A
2 1 15 Type A
2 2 8 Type A
3 0 10 All
3 1 5 All
3 2 6 All
3 3 3 All
3 0 20 Type A
3 1 15 Type A
3 2 8 Type A
3 3 11 Type A
4 0 10 All
4 1 5 All
4 2 6 All
4 3 3 All
4 4 1 All
4 0 20 Type A
4 1 15 Type A
4 2 8 Type A
4 3 11 Type A
4 4 2 Type A

I would like to get this result:

A B Percent Groupby sum
2 0 10 All 10
2 1 5 All 11
2 2 6 All 11
2 0 20 Type A 20
2 1 15 Type A 23
2 2 8 Type A 23
3 0 10 All 15
3 1 5 All 15
3 2 6 All 9
3 3 3 All 9
3 0 20 Type A 35
3 1 15 Type A 35
3 2 8 Type A 19
3 3 11 Type A 19
4 0 10 All 15
4 1 5 All 15
4 2 6 All 10
4 3 3 All 10
4 4 1 All 10
4 0 20 Type A 35
4 1 15 Type A 35
4 2 8 Type A 21
4 3 11 Type A 21
4 4 2 Type A 21

The calculation is groupby "Groupby"

  • if Col A is 2 then percent is summed for Col B with values of 1 and 2.
  • if Col A is 3 then percent is summed for Col B with values of 0 and 1, and 2 and 3.
  • if Col A is 4 then percent is summed for Col B with values of 0 and 1, and 2, 3, and 4.

Is there a quick way to do this? Thank you.

答案1

得分: 3

With your edit, it can be interesting to change the strategy. What you should do is to create subgroups for each condition:

conds = [
#   (2, [0]),
    (2, [1, 2]),
    (3, [0, 1]),
    (3, [2, 3]),
    (4, [0, 1]),
    (4, [2, 3, 4]),
]

# Create subgroups according to your conditions
g = sum([i * (df['A'].eq(a) & df['B'].isin(b))
           for i, (a, b) in enumerate(conds, 1)]).mask(lambda x: x==0)

# Same with np.select
# masks = [(df['A'].eq(a) & df['B'].isin(b)) for a, b in conds]
# choices = 1 + np.arange(len(masks))
# g = np.select(masks, choices, default=np.nan)

# Group by A and Groupby columns and your new subgroups
df['sum'] = (df.groupby(['A', 'Groupby', g])['Percent']
               .transform('sum').fillna(df['Percent']))

Note: in fact, to avoid fillna, you can be explicit and append (2, [0]) to the condition list to match every combination.

英文:

With your edit, it can be interesting to change the strategy. What you should do is to create subgroups for each condition:

conds = [
#   (2, [0]),
    (2, [1, 2]),
    (3, [0, 1]),
    (3, [2, 3]),
    (4, [0, 1]),
    (4, [2, 3, 4]),
]

# Create subgroups according your conditions
g = sum([i * (df['A'].eq(a) & df['B'].isin(b))
           for i, (a, b) in enumerate(conds, 1)]).mask(lambda x: x==0)

# Same with np.select
# masks = [(df['A'].eq(a) & df['B'].isin(b)) for a, b in conds]
# choices = 1 + np.arange(len(masks))
# g = np.select(masks, choices, default=np.nan)

# Group by A and Groupby columns and your new subgroups
df['sum'] = (df.groupby(['A', 'Groupby', g])['Percent']
               .transform('sum').fillna(df['Percent']))

Note: in fact, to avoid fillna, you can be explicit and append (2, [0]) to the condition list to match every combinations.

The output is now:

>>> df
    A  B  Percent Groupby   sum
0   2  0       10     All  10.0
1   2  1        5     All  11.0
2   2  2        6     All  11.0
3   2  0       20  Type A  20.0
4   2  1       15  Type A  23.0
5   2  2        8  Type A  23.0
6   3  0       10     All  15.0
7   3  1        5     All  15.0
8   3  2        6     All   9.0
9   3  3        3     All   9.0
10  3  0       20  Type A  35.0
11  3  1       15  Type A  35.0
12  3  2        8  Type A  19.0
13  3  3       11  Type A  19.0
14  4  0       10     All  15.0
15  4  1        5     All  15.0
16  4  2        6     All  10.0
17  4  3        3     All  10.0
18  4  4        1     All  10.0
19  4  0       20  Type A  35.0
20  4  1       15  Type A  35.0
21  4  2        8  Type A  21.0
22  4  3       11  Type A  21.0
23  4  4        2  Type A  21.0

And subgroups:

>>> g
0     NaN
1     1.0
2     1.0
3     NaN
4     1.0
5     1.0
6     2.0
7     2.0
8     3.0
9     3.0
10    2.0
11    2.0
12    3.0
13    3.0
14    4.0
15    4.0
16    5.0
17    5.0
18    5.0
19    4.0
20    4.0
21    5.0
22    5.0
23    5.0
dtype: float64

Old answer

EDIT: As suggested by @jezrael, you can avoid to write all conditions by using a mapping dict and a comprehension:

You can use np.select to match your conditions then use groupby_transform to broadcast the sum on right rows:

# conds = [df['A'] == 2,
#          df['A'] == 3,
#          df['A'] == 4]

# choices = [df['Percent'].where(df['B'].isin([1, 2])),
#            df['Percent'].where(df['B'].isin([2, 3])),
#            df['Percent'].where(df['B'].isin([2, 3, 4]))]

# Equivalent to
dmap = {2: [1, 2], 3: [2, 3], 4: [2, 3, 4]}

#                       conds --v           choices --v
conds, choices = zip(*[(df['A'] == k, df['Percent'].where(df['B'].isin(v)))
                       for k, v in dmap.items()])


df['sum'] = np.select(conds, choices)

df['sum'] = (df.mask(df['sum'].isna())
               .groupby(['Groupby', 'A'])['sum']
               .transform('sum').fillna(df['Percent']))

Output:

>>> df
    A  B  Percent Groupby   sum
0   2  0       10     All  10.0
1   2  1        5     All  11.0
2   2  2        6     All  11.0
3   2  0       20  Type A  20.0
4   2  1       15  Type A  23.0
5   2  2        8  Type A  23.0
6   3  0       10     All  10.0
7   3  1        5     All   5.0
8   3  2        6     All   9.0
9   3  3        3     All   9.0
10  3  0       20  Type A  20.0
11  3  1       15  Type A  15.0
12  3  2        8  Type A  19.0
13  3  3       11  Type A  19.0
14  4  0       10     All  10.0
15  4  1        5     All   5.0
16  4  2        6     All  10.0
17  4  3        3     All  10.0
18  4  4        1     All  10.0
19  4  0       20  Type A  20.0
20  4  1       15  Type A  15.0
21  4  2        8  Type A  21.0
22  4  3       11  Type A  21.0
23  4  4        2  Type A  21.0

After np.select, the result is:

>>> df['sum']
0      NaN
1      5.0
2      6.0
3      NaN
4     15.0
5      8.0
6      NaN
7      NaN
8      6.0
9      3.0
10     NaN
11     NaN
12     8.0
13    11.0
14     NaN
15     NaN
16     6.0
17     3.0
18     1.0
19     NaN
20     NaN
21     8.0
22    11.0
23     2.0
Name: sum, dtype: float64

答案2

得分: 2

为避免指定许多条件,创建将AB组进行映射的字典,转换为DataFrame并使用DataFrame.merge进行right连接,以便在GroupBy.transform中为每个组可能创建新列的sum,仅当通过numpy.where匹配条件时:

d = {2:[1,2], 3:[2,3], 4:[2,3,4]}

df = pd.DataFrame([(k, x) for k, v in d.items() for x in v],
                  columns=['A','B']).merge(df, how='right', indicator=True)

m = df.pop('_merge').eq('both')
df['sum'] = np.where(m, 
                     df.groupby(['Groupby', 'A', m])['Percent'].transform('sum'),
                     df['Percent'])

print (df)
    A  B  Percent Groupby  sum
0   2  0       10     All   10
1   2  1        5     All   11
2   2  2        6     All   11
3   2  0       20  Type A   20
4   2  1       15  Type A   23
5   2  2        8  Type A   23
6   3  0       10     All   10
7   3  1        5     All    5
8   3  2        6     All    9
9   3  3        3     All    9
10  3  0       20  Type A   20
11  3  1       15  Type A   15
12  3  2        8  Type A   19
13  3  3       11  Type A   19
14  4  0       10     All   10
15  4  1        5     All    5
16  4  2        6     All   10
17  4  3        3     All   10
18  4  4        1     All   10
19  4  0       20  Type A   20
20  4  1       15  Type A   15
21  4  2        8  Type A   21
22  4  3       11  Type A   21
23  4  4        2  Type A   21
英文:

For avoid specify many conditions create dictionary for mapping A and B groups, convert to DataFrame and use DataFrame.merge with right join, so possible create sums per groups to new column in GroupBy.transform only if match condition by numpy.where:

d = {2:[1,2], 3:[2,3], 4:[2,3,4]}


df = pd.DataFrame([(k, x) for k, v in d.items() for x in v],
                  columns=['A','B']).merge(df, how='right', indicator=True)

m = df.pop('_merge').eq('both')
df['sum'] = np.where(m, 
                     df.groupby(['Groupby', 'A', m])['Percent'].transform('sum'),
                     df['Percent'])

print (df)
    A  B  Percent Groupby  sum
0   2  0       10     All   10
1   2  1        5     All   11
2   2  2        6     All   11
3   2  0       20  Type A   20
4   2  1       15  Type A   23
5   2  2        8  Type A   23
6   3  0       10     All   10
7   3  1        5     All    5
8   3  2        6     All    9
9   3  3        3     All    9
10  3  0       20  Type A   20
11  3  1       15  Type A   15
12  3  2        8  Type A   19
13  3  3       11  Type A   19
14  4  0       10     All   10
15  4  1        5     All    5
16  4  2        6     All   10
17  4  3        3     All   10
18  4  4        1     All   10
19  4  0       20  Type A   20
20  4  1       15  Type A   15
21  4  2        8  Type A   21
22  4  3       11  Type A   21
23  4  4        2  Type A   21

huangapple
  • 本文由 发表于 2023年6月29日 11:24:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76577883.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定