合并符合条件的行并按分组求和

huangapple go评论81阅读模式
英文:

sum rows with condition and groupby

问题

我有一个以下的数据框:

A B Percent Groupby
2 0 10 All
2 1 5 All
2 2 6 All
2 0 20 Type A
2 1 15 Type A
2 2 8 Type A
3 0 10 All
3 1 5 All
3 2 6 All
3 3 3 All
3 0 20 Type A
3 1 15 Type A
3 2 8 Type A
3 3 11 Type A
4 0 10 All
4 1 5 All
4 2 6 All
4 3 3 All
4 4 1 All
4 0 20 Type A
4 1 15 Type A
4 2 8 Type A
4 3 11 Type A
4 4 2 Type A

我想要得到这个结果:

A B Percent Groupby sum
2 0 10 All 10
2 1 5 All 11
2 2 6 All 11
2 0 20 Type A 20
2 1 15 Type A 23
2 2 8 Type A 23
3 0 10 All 15
3 1 5 All 15
3 2 6 All 9
3 3 3 All 9
3 0 20 Type A 35
3 1 15 Type A 35
3 2 8 Type A 19
3 3 11 Type A 19
4 0 10 All 15
4 1 5 All 15
4 2 6 All 10
4 3 3 All 10
4 4 1 All 10
4 0 20 Type A 35
4 1 15 Type A 35
4 2 8 Type A 21
4 3 11 Type A 21
4 4 2 Type A 21

计算是按照 "Groupby" 进行的:

  • 如果列 A 为 2,则将百分比相加,其中列 B 的值为 1 和 2。
  • 如果列 A 为 3,则将百分比相加,其中列 B 的值为 0 和 1,以及 2 和 3。
  • 如果列 A 为 4,则将百分比相加,其中列 B 的值为 0 和 1,以及 2、3 和 4。

有没有快速的方法来做到这一点?谢谢。

英文:

I have a dataframe below:

A B Percent Groupby
2 0 10 All
2 1 5 All
2 2 6 All
2 0 20 Type A
2 1 15 Type A
2 2 8 Type A
3 0 10 All
3 1 5 All
3 2 6 All
3 3 3 All
3 0 20 Type A
3 1 15 Type A
3 2 8 Type A
3 3 11 Type A
4 0 10 All
4 1 5 All
4 2 6 All
4 3 3 All
4 4 1 All
4 0 20 Type A
4 1 15 Type A
4 2 8 Type A
4 3 11 Type A
4 4 2 Type A

I would like to get this result:

A B Percent Groupby sum
2 0 10 All 10
2 1 5 All 11
2 2 6 All 11
2 0 20 Type A 20
2 1 15 Type A 23
2 2 8 Type A 23
3 0 10 All 15
3 1 5 All 15
3 2 6 All 9
3 3 3 All 9
3 0 20 Type A 35
3 1 15 Type A 35
3 2 8 Type A 19
3 3 11 Type A 19
4 0 10 All 15
4 1 5 All 15
4 2 6 All 10
4 3 3 All 10
4 4 1 All 10
4 0 20 Type A 35
4 1 15 Type A 35
4 2 8 Type A 21
4 3 11 Type A 21
4 4 2 Type A 21

The calculation is groupby "Groupby"

  • if Col A is 2 then percent is summed for Col B with values of 1 and 2.
  • if Col A is 3 then percent is summed for Col B with values of 0 and 1, and 2 and 3.
  • if Col A is 4 then percent is summed for Col B with values of 0 and 1, and 2, 3, and 4.

Is there a quick way to do this? Thank you.

答案1

得分: 3

With your edit, it can be interesting to change the strategy. What you should do is to create subgroups for each condition:

  1. conds = [
  2. # (2, [0]),
  3. (2, [1, 2]),
  4. (3, [0, 1]),
  5. (3, [2, 3]),
  6. (4, [0, 1]),
  7. (4, [2, 3, 4]),
  8. ]
  9. # Create subgroups according to your conditions
  10. g = sum([i * (df['A'].eq(a) & df['B'].isin(b))
  11. for i, (a, b) in enumerate(conds, 1)]).mask(lambda x: x==0)
  12. # Same with np.select
  13. # masks = [(df['A'].eq(a) & df['B'].isin(b)) for a, b in conds]
  14. # choices = 1 + np.arange(len(masks))
  15. # g = np.select(masks, choices, default=np.nan)
  16. # Group by A and Groupby columns and your new subgroups
  17. df['sum'] = (df.groupby(['A', 'Groupby', g])['Percent']
  18. .transform('sum').fillna(df['Percent']))

Note: in fact, to avoid fillna, you can be explicit and append (2, [0]) to the condition list to match every combination.

英文:

With your edit, it can be interesting to change the strategy. What you should do is to create subgroups for each condition:

  1. conds = [
  2. # (2, [0]),
  3. (2, [1, 2]),
  4. (3, [0, 1]),
  5. (3, [2, 3]),
  6. (4, [0, 1]),
  7. (4, [2, 3, 4]),
  8. ]
  9. # Create subgroups according your conditions
  10. g = sum([i * (df['A'].eq(a) & df['B'].isin(b))
  11. for i, (a, b) in enumerate(conds, 1)]).mask(lambda x: x==0)
  12. # Same with np.select
  13. # masks = [(df['A'].eq(a) & df['B'].isin(b)) for a, b in conds]
  14. # choices = 1 + np.arange(len(masks))
  15. # g = np.select(masks, choices, default=np.nan)
  16. # Group by A and Groupby columns and your new subgroups
  17. df['sum'] = (df.groupby(['A', 'Groupby', g])['Percent']
  18. .transform('sum').fillna(df['Percent']))

Note: in fact, to avoid fillna, you can be explicit and append (2, [0]) to the condition list to match every combinations.

The output is now:

  1. >>> df
  2. A B Percent Groupby sum
  3. 0 2 0 10 All 10.0
  4. 1 2 1 5 All 11.0
  5. 2 2 2 6 All 11.0
  6. 3 2 0 20 Type A 20.0
  7. 4 2 1 15 Type A 23.0
  8. 5 2 2 8 Type A 23.0
  9. 6 3 0 10 All 15.0
  10. 7 3 1 5 All 15.0
  11. 8 3 2 6 All 9.0
  12. 9 3 3 3 All 9.0
  13. 10 3 0 20 Type A 35.0
  14. 11 3 1 15 Type A 35.0
  15. 12 3 2 8 Type A 19.0
  16. 13 3 3 11 Type A 19.0
  17. 14 4 0 10 All 15.0
  18. 15 4 1 5 All 15.0
  19. 16 4 2 6 All 10.0
  20. 17 4 3 3 All 10.0
  21. 18 4 4 1 All 10.0
  22. 19 4 0 20 Type A 35.0
  23. 20 4 1 15 Type A 35.0
  24. 21 4 2 8 Type A 21.0
  25. 22 4 3 11 Type A 21.0
  26. 23 4 4 2 Type A 21.0

And subgroups:

  1. >>> g
  2. 0 NaN
  3. 1 1.0
  4. 2 1.0
  5. 3 NaN
  6. 4 1.0
  7. 5 1.0
  8. 6 2.0
  9. 7 2.0
  10. 8 3.0
  11. 9 3.0
  12. 10 2.0
  13. 11 2.0
  14. 12 3.0
  15. 13 3.0
  16. 14 4.0
  17. 15 4.0
  18. 16 5.0
  19. 17 5.0
  20. 18 5.0
  21. 19 4.0
  22. 20 4.0
  23. 21 5.0
  24. 22 5.0
  25. 23 5.0
  26. dtype: float64

Old answer

EDIT: As suggested by @jezrael, you can avoid to write all conditions by using a mapping dict and a comprehension:

You can use np.select to match your conditions then use groupby_transform to broadcast the sum on right rows:

  1. # conds = [df['A'] == 2,
  2. # df['A'] == 3,
  3. # df['A'] == 4]
  4. # choices = [df['Percent'].where(df['B'].isin([1, 2])),
  5. # df['Percent'].where(df['B'].isin([2, 3])),
  6. # df['Percent'].where(df['B'].isin([2, 3, 4]))]
  7. # Equivalent to
  8. dmap = {2: [1, 2], 3: [2, 3], 4: [2, 3, 4]}
  9. # conds --v choices --v
  10. conds, choices = zip(*[(df['A'] == k, df['Percent'].where(df['B'].isin(v)))
  11. for k, v in dmap.items()])
  12. df['sum'] = np.select(conds, choices)
  13. df['sum'] = (df.mask(df['sum'].isna())
  14. .groupby(['Groupby', 'A'])['sum']
  15. .transform('sum').fillna(df['Percent']))

Output:

  1. >>> df
  2. A B Percent Groupby sum
  3. 0 2 0 10 All 10.0
  4. 1 2 1 5 All 11.0
  5. 2 2 2 6 All 11.0
  6. 3 2 0 20 Type A 20.0
  7. 4 2 1 15 Type A 23.0
  8. 5 2 2 8 Type A 23.0
  9. 6 3 0 10 All 10.0
  10. 7 3 1 5 All 5.0
  11. 8 3 2 6 All 9.0
  12. 9 3 3 3 All 9.0
  13. 10 3 0 20 Type A 20.0
  14. 11 3 1 15 Type A 15.0
  15. 12 3 2 8 Type A 19.0
  16. 13 3 3 11 Type A 19.0
  17. 14 4 0 10 All 10.0
  18. 15 4 1 5 All 5.0
  19. 16 4 2 6 All 10.0
  20. 17 4 3 3 All 10.0
  21. 18 4 4 1 All 10.0
  22. 19 4 0 20 Type A 20.0
  23. 20 4 1 15 Type A 15.0
  24. 21 4 2 8 Type A 21.0
  25. 22 4 3 11 Type A 21.0
  26. 23 4 4 2 Type A 21.0

After np.select, the result is:

  1. >>> df['sum']
  2. 0 NaN
  3. 1 5.0
  4. 2 6.0
  5. 3 NaN
  6. 4 15.0
  7. 5 8.0
  8. 6 NaN
  9. 7 NaN
  10. 8 6.0
  11. 9 3.0
  12. 10 NaN
  13. 11 NaN
  14. 12 8.0
  15. 13 11.0
  16. 14 NaN
  17. 15 NaN
  18. 16 6.0
  19. 17 3.0
  20. 18 1.0
  21. 19 NaN
  22. 20 NaN
  23. 21 8.0
  24. 22 11.0
  25. 23 2.0
  26. Name: sum, dtype: float64

答案2

得分: 2

为避免指定许多条件,创建将AB组进行映射的字典,转换为DataFrame并使用DataFrame.merge进行right连接,以便在GroupBy.transform中为每个组可能创建新列的sum,仅当通过numpy.where匹配条件时:

  1. d = {2:[1,2], 3:[2,3], 4:[2,3,4]}
  2. df = pd.DataFrame([(k, x) for k, v in d.items() for x in v],
  3. columns=['A','B']).merge(df, how='right', indicator=True)
  4. m = df.pop('_merge').eq('both')
  5. df['sum'] = np.where(m,
  6. df.groupby(['Groupby', 'A', m])['Percent'].transform('sum'),
  7. df['Percent'])

  1. print (df)
  2. A B Percent Groupby sum
  3. 0 2 0 10 All 10
  4. 1 2 1 5 All 11
  5. 2 2 2 6 All 11
  6. 3 2 0 20 Type A 20
  7. 4 2 1 15 Type A 23
  8. 5 2 2 8 Type A 23
  9. 6 3 0 10 All 10
  10. 7 3 1 5 All 5
  11. 8 3 2 6 All 9
  12. 9 3 3 3 All 9
  13. 10 3 0 20 Type A 20
  14. 11 3 1 15 Type A 15
  15. 12 3 2 8 Type A 19
  16. 13 3 3 11 Type A 19
  17. 14 4 0 10 All 10
  18. 15 4 1 5 All 5
  19. 16 4 2 6 All 10
  20. 17 4 3 3 All 10
  21. 18 4 4 1 All 10
  22. 19 4 0 20 Type A 20
  23. 20 4 1 15 Type A 15
  24. 21 4 2 8 Type A 21
  25. 22 4 3 11 Type A 21
  26. 23 4 4 2 Type A 21
英文:

For avoid specify many conditions create dictionary for mapping A and B groups, convert to DataFrame and use DataFrame.merge with right join, so possible create sums per groups to new column in GroupBy.transform only if match condition by numpy.where:

  1. d = {2:[1,2], 3:[2,3], 4:[2,3,4]}
  2. df = pd.DataFrame([(k, x) for k, v in d.items() for x in v],
  3. columns=['A','B']).merge(df, how='right', indicator=True)
  4. m = df.pop('_merge').eq('both')
  5. df['sum'] = np.where(m,
  6. df.groupby(['Groupby', 'A', m])['Percent'].transform('sum'),
  7. df['Percent'])

  1. print (df)
  2. A B Percent Groupby sum
  3. 0 2 0 10 All 10
  4. 1 2 1 5 All 11
  5. 2 2 2 6 All 11
  6. 3 2 0 20 Type A 20
  7. 4 2 1 15 Type A 23
  8. 5 2 2 8 Type A 23
  9. 6 3 0 10 All 10
  10. 7 3 1 5 All 5
  11. 8 3 2 6 All 9
  12. 9 3 3 3 All 9
  13. 10 3 0 20 Type A 20
  14. 11 3 1 15 Type A 15
  15. 12 3 2 8 Type A 19
  16. 13 3 3 11 Type A 19
  17. 14 4 0 10 All 10
  18. 15 4 1 5 All 5
  19. 16 4 2 6 All 10
  20. 17 4 3 3 All 10
  21. 18 4 4 1 All 10
  22. 19 4 0 20 Type A 20
  23. 20 4 1 15 Type A 15
  24. 21 4 2 8 Type A 21
  25. 22 4 3 11 Type A 21
  26. 23 4 4 2 Type A 21

huangapple
  • 本文由 发表于 2023年6月29日 11:24:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76577883.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定