英文:
sum rows with condition and groupby
问题
我有一个以下的数据框:
A | B | Percent | Groupby |
---|---|---|---|
2 | 0 | 10 | All |
2 | 1 | 5 | All |
2 | 2 | 6 | All |
2 | 0 | 20 | Type A |
2 | 1 | 15 | Type A |
2 | 2 | 8 | Type A |
3 | 0 | 10 | All |
3 | 1 | 5 | All |
3 | 2 | 6 | All |
3 | 3 | 3 | All |
3 | 0 | 20 | Type A |
3 | 1 | 15 | Type A |
3 | 2 | 8 | Type A |
3 | 3 | 11 | Type A |
4 | 0 | 10 | All |
4 | 1 | 5 | All |
4 | 2 | 6 | All |
4 | 3 | 3 | All |
4 | 4 | 1 | All |
4 | 0 | 20 | Type A |
4 | 1 | 15 | Type A |
4 | 2 | 8 | Type A |
4 | 3 | 11 | Type A |
4 | 4 | 2 | Type A |
我想要得到这个结果:
A | B | Percent | Groupby | sum |
---|---|---|---|---|
2 | 0 | 10 | All | 10 |
2 | 1 | 5 | All | 11 |
2 | 2 | 6 | All | 11 |
2 | 0 | 20 | Type A | 20 |
2 | 1 | 15 | Type A | 23 |
2 | 2 | 8 | Type A | 23 |
3 | 0 | 10 | All | 15 |
3 | 1 | 5 | All | 15 |
3 | 2 | 6 | All | 9 |
3 | 3 | 3 | All | 9 |
3 | 0 | 20 | Type A | 35 |
3 | 1 | 15 | Type A | 35 |
3 | 2 | 8 | Type A | 19 |
3 | 3 | 11 | Type A | 19 |
4 | 0 | 10 | All | 15 |
4 | 1 | 5 | All | 15 |
4 | 2 | 6 | All | 10 |
4 | 3 | 3 | All | 10 |
4 | 4 | 1 | All | 10 |
4 | 0 | 20 | Type A | 35 |
4 | 1 | 15 | Type A | 35 |
4 | 2 | 8 | Type A | 21 |
4 | 3 | 11 | Type A | 21 |
4 | 4 | 2 | Type A | 21 |
计算是按照 "Groupby" 进行的:
- 如果列 A 为 2,则将百分比相加,其中列 B 的值为 1 和 2。
- 如果列 A 为 3,则将百分比相加,其中列 B 的值为 0 和 1,以及 2 和 3。
- 如果列 A 为 4,则将百分比相加,其中列 B 的值为 0 和 1,以及 2、3 和 4。
有没有快速的方法来做到这一点?谢谢。
英文:
I have a dataframe below:
A | B | Percent | Groupby |
---|---|---|---|
2 | 0 | 10 | All |
2 | 1 | 5 | All |
2 | 2 | 6 | All |
2 | 0 | 20 | Type A |
2 | 1 | 15 | Type A |
2 | 2 | 8 | Type A |
3 | 0 | 10 | All |
3 | 1 | 5 | All |
3 | 2 | 6 | All |
3 | 3 | 3 | All |
3 | 0 | 20 | Type A |
3 | 1 | 15 | Type A |
3 | 2 | 8 | Type A |
3 | 3 | 11 | Type A |
4 | 0 | 10 | All |
4 | 1 | 5 | All |
4 | 2 | 6 | All |
4 | 3 | 3 | All |
4 | 4 | 1 | All |
4 | 0 | 20 | Type A |
4 | 1 | 15 | Type A |
4 | 2 | 8 | Type A |
4 | 3 | 11 | Type A |
4 | 4 | 2 | Type A |
I would like to get this result:
A | B | Percent | Groupby | sum |
---|---|---|---|---|
2 | 0 | 10 | All | 10 |
2 | 1 | 5 | All | 11 |
2 | 2 | 6 | All | 11 |
2 | 0 | 20 | Type A | 20 |
2 | 1 | 15 | Type A | 23 |
2 | 2 | 8 | Type A | 23 |
3 | 0 | 10 | All | 15 |
3 | 1 | 5 | All | 15 |
3 | 2 | 6 | All | 9 |
3 | 3 | 3 | All | 9 |
3 | 0 | 20 | Type A | 35 |
3 | 1 | 15 | Type A | 35 |
3 | 2 | 8 | Type A | 19 |
3 | 3 | 11 | Type A | 19 |
4 | 0 | 10 | All | 15 |
4 | 1 | 5 | All | 15 |
4 | 2 | 6 | All | 10 |
4 | 3 | 3 | All | 10 |
4 | 4 | 1 | All | 10 |
4 | 0 | 20 | Type A | 35 |
4 | 1 | 15 | Type A | 35 |
4 | 2 | 8 | Type A | 21 |
4 | 3 | 11 | Type A | 21 |
4 | 4 | 2 | Type A | 21 |
The calculation is groupby "Groupby"
- if Col A is 2 then percent is summed for Col B with values of 1 and 2.
- if Col A is 3 then percent is summed for Col B with values of 0 and 1, and 2 and 3.
- if Col A is 4 then percent is summed for Col B with values of 0 and 1, and 2, 3, and 4.
Is there a quick way to do this? Thank you.
答案1
得分: 3
With your edit, it can be interesting to change the strategy. What you should do is to create subgroups for each condition:
conds = [
# (2, [0]),
(2, [1, 2]),
(3, [0, 1]),
(3, [2, 3]),
(4, [0, 1]),
(4, [2, 3, 4]),
]
# Create subgroups according to your conditions
g = sum([i * (df['A'].eq(a) & df['B'].isin(b))
for i, (a, b) in enumerate(conds, 1)]).mask(lambda x: x==0)
# Same with np.select
# masks = [(df['A'].eq(a) & df['B'].isin(b)) for a, b in conds]
# choices = 1 + np.arange(len(masks))
# g = np.select(masks, choices, default=np.nan)
# Group by A and Groupby columns and your new subgroups
df['sum'] = (df.groupby(['A', 'Groupby', g])['Percent']
.transform('sum').fillna(df['Percent']))
Note: in fact, to avoid fillna
, you can be explicit and append (2, [0])
to the condition list to match every combination.
英文:
With your edit, it can be interesting to change the strategy. What you should do is to create subgroups for each condition:
conds = [
# (2, [0]),
(2, [1, 2]),
(3, [0, 1]),
(3, [2, 3]),
(4, [0, 1]),
(4, [2, 3, 4]),
]
# Create subgroups according your conditions
g = sum([i * (df['A'].eq(a) & df['B'].isin(b))
for i, (a, b) in enumerate(conds, 1)]).mask(lambda x: x==0)
# Same with np.select
# masks = [(df['A'].eq(a) & df['B'].isin(b)) for a, b in conds]
# choices = 1 + np.arange(len(masks))
# g = np.select(masks, choices, default=np.nan)
# Group by A and Groupby columns and your new subgroups
df['sum'] = (df.groupby(['A', 'Groupby', g])['Percent']
.transform('sum').fillna(df['Percent']))
Note: in fact, to avoid fillna
, you can be explicit and append (2, [0])
to the condition list to match every combinations.
The output is now:
>>> df
A B Percent Groupby sum
0 2 0 10 All 10.0
1 2 1 5 All 11.0
2 2 2 6 All 11.0
3 2 0 20 Type A 20.0
4 2 1 15 Type A 23.0
5 2 2 8 Type A 23.0
6 3 0 10 All 15.0
7 3 1 5 All 15.0
8 3 2 6 All 9.0
9 3 3 3 All 9.0
10 3 0 20 Type A 35.0
11 3 1 15 Type A 35.0
12 3 2 8 Type A 19.0
13 3 3 11 Type A 19.0
14 4 0 10 All 15.0
15 4 1 5 All 15.0
16 4 2 6 All 10.0
17 4 3 3 All 10.0
18 4 4 1 All 10.0
19 4 0 20 Type A 35.0
20 4 1 15 Type A 35.0
21 4 2 8 Type A 21.0
22 4 3 11 Type A 21.0
23 4 4 2 Type A 21.0
And subgroups:
>>> g
0 NaN
1 1.0
2 1.0
3 NaN
4 1.0
5 1.0
6 2.0
7 2.0
8 3.0
9 3.0
10 2.0
11 2.0
12 3.0
13 3.0
14 4.0
15 4.0
16 5.0
17 5.0
18 5.0
19 4.0
20 4.0
21 5.0
22 5.0
23 5.0
dtype: float64
Old answer
EDIT: As suggested by @jezrael, you can avoid to write all conditions by using a mapping dict and a comprehension:
You can use np.select
to match your conditions then use groupby_transform
to broadcast the sum on right rows:
# conds = [df['A'] == 2,
# df['A'] == 3,
# df['A'] == 4]
# choices = [df['Percent'].where(df['B'].isin([1, 2])),
# df['Percent'].where(df['B'].isin([2, 3])),
# df['Percent'].where(df['B'].isin([2, 3, 4]))]
# Equivalent to
dmap = {2: [1, 2], 3: [2, 3], 4: [2, 3, 4]}
# conds --v choices --v
conds, choices = zip(*[(df['A'] == k, df['Percent'].where(df['B'].isin(v)))
for k, v in dmap.items()])
df['sum'] = np.select(conds, choices)
df['sum'] = (df.mask(df['sum'].isna())
.groupby(['Groupby', 'A'])['sum']
.transform('sum').fillna(df['Percent']))
Output:
>>> df
A B Percent Groupby sum
0 2 0 10 All 10.0
1 2 1 5 All 11.0
2 2 2 6 All 11.0
3 2 0 20 Type A 20.0
4 2 1 15 Type A 23.0
5 2 2 8 Type A 23.0
6 3 0 10 All 10.0
7 3 1 5 All 5.0
8 3 2 6 All 9.0
9 3 3 3 All 9.0
10 3 0 20 Type A 20.0
11 3 1 15 Type A 15.0
12 3 2 8 Type A 19.0
13 3 3 11 Type A 19.0
14 4 0 10 All 10.0
15 4 1 5 All 5.0
16 4 2 6 All 10.0
17 4 3 3 All 10.0
18 4 4 1 All 10.0
19 4 0 20 Type A 20.0
20 4 1 15 Type A 15.0
21 4 2 8 Type A 21.0
22 4 3 11 Type A 21.0
23 4 4 2 Type A 21.0
After np.select
, the result is:
>>> df['sum']
0 NaN
1 5.0
2 6.0
3 NaN
4 15.0
5 8.0
6 NaN
7 NaN
8 6.0
9 3.0
10 NaN
11 NaN
12 8.0
13 11.0
14 NaN
15 NaN
16 6.0
17 3.0
18 1.0
19 NaN
20 NaN
21 8.0
22 11.0
23 2.0
Name: sum, dtype: float64
答案2
得分: 2
为避免指定许多条件,创建将A
和B
组进行映射的字典,转换为DataFrame
并使用DataFrame.merge
进行right
连接,以便在GroupBy.transform
中为每个组可能创建新列的sum
,仅当通过numpy.where
匹配条件时:
d = {2:[1,2], 3:[2,3], 4:[2,3,4]}
df = pd.DataFrame([(k, x) for k, v in d.items() for x in v],
columns=['A','B']).merge(df, how='right', indicator=True)
m = df.pop('_merge').eq('both')
df['sum'] = np.where(m,
df.groupby(['Groupby', 'A', m])['Percent'].transform('sum'),
df['Percent'])
print (df)
A B Percent Groupby sum
0 2 0 10 All 10
1 2 1 5 All 11
2 2 2 6 All 11
3 2 0 20 Type A 20
4 2 1 15 Type A 23
5 2 2 8 Type A 23
6 3 0 10 All 10
7 3 1 5 All 5
8 3 2 6 All 9
9 3 3 3 All 9
10 3 0 20 Type A 20
11 3 1 15 Type A 15
12 3 2 8 Type A 19
13 3 3 11 Type A 19
14 4 0 10 All 10
15 4 1 5 All 5
16 4 2 6 All 10
17 4 3 3 All 10
18 4 4 1 All 10
19 4 0 20 Type A 20
20 4 1 15 Type A 15
21 4 2 8 Type A 21
22 4 3 11 Type A 21
23 4 4 2 Type A 21
英文:
For avoid specify many conditions create dictionary for mapping A
and B
groups, convert to DataFrame
and use DataFrame.merge
with right
join, so possible create sum
s per groups to new column in GroupBy.transform
only if match condition by numpy.where
:
d = {2:[1,2], 3:[2,3], 4:[2,3,4]}
df = pd.DataFrame([(k, x) for k, v in d.items() for x in v],
columns=['A','B']).merge(df, how='right', indicator=True)
m = df.pop('_merge').eq('both')
df['sum'] = np.where(m,
df.groupby(['Groupby', 'A', m])['Percent'].transform('sum'),
df['Percent'])
print (df)
A B Percent Groupby sum
0 2 0 10 All 10
1 2 1 5 All 11
2 2 2 6 All 11
3 2 0 20 Type A 20
4 2 1 15 Type A 23
5 2 2 8 Type A 23
6 3 0 10 All 10
7 3 1 5 All 5
8 3 2 6 All 9
9 3 3 3 All 9
10 3 0 20 Type A 20
11 3 1 15 Type A 15
12 3 2 8 Type A 19
13 3 3 11 Type A 19
14 4 0 10 All 10
15 4 1 5 All 5
16 4 2 6 All 10
17 4 3 3 All 10
18 4 4 1 All 10
19 4 0 20 Type A 20
20 4 1 15 Type A 15
21 4 2 8 Type A 21
22 4 3 11 Type A 21
23 4 4 2 Type A 21
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论