英文:
Apply filter for groupby aggregate function in Python Pandas
问题
如何在Pandas中为groupby聚合函数应用过滤器?
我有一个DataFrame
data = {'Fruit':['apple', 'apple', 'apple', 'kivi', 'kivi', 'kivi'],
        'Y_or_N': ['Y', 'N', 'Y', 'N', 'N', 'Y'], 
        'A_or_B': ['A', 'A', 'B', 'A', 'B', 'A'],
        'Number': [3, 5, 6, 7, 2, 4]}
df = pd.DataFrame.from_dict(data)
我想对每个水果组在3列中求和Number值:(1) 所有值,(2) 其中'Y_or_N'=='Y',(3) 其中'A_or_B'=='A'。
我尝试了以下方法:
new_df = df.groupby(['Fruit']).apply(lambda x: x[x['Y_or_N'] == 'Y' ].agg(sum_Y=('Number', 'sum')))
这个方法有效,但仅适用于一个列。是否有更有效的方法来为不同列应用不同的过滤器和聚合函数?而不是创建3个数据框然后将它们合并在一起。
期望的输出:
| Fruit | sum_all | sum_Y | sum_A | 
|---|---|---|---|
| apple | 14 | 9 | 8 | 
| kivi | 13 | 4 | 11 | 
英文:
How to apply filter for groupby aggregate function in Pandas?
I have DataFrame
data = {'Fruit':['apple', 'apple', 'apple', 'kivi', 'kivi', 'kivi'],
              'Y_or_N': ['Y', 'N', 'Y', 'N', 'N', 'Y'], 
              'A_or_B': ['A', 'A', 'B', 'A', 'B', 'A'],
              'Number': [3, 5, 6, 7, 2, 4]}
df = pd.DataFrame.from_dict(data)
I want for each fruit group sum Number values in 3 columns: (1) all values, (2) where 'Y_or_N'=='Y', (3) where 'A_or_B'=='A'.
I have tried the following:
new_df = df.groupby(['Fruit']).apply(lambda x: x[x['Y_or_N'] == 'Y' ].agg(sum_Y=('Number', 'sum')))
This works, but only for 1 column. Is there a more efficient way to apply different filters for different columns and aggregate functions? Without making 3 df and then merging them together.
Desired output:
| Fruit | sum_all | sum_Y | sum_A | 
|---|---|---|---|
| apple | 14 | 9 | 8 | 
| kivi | 13 | 4 | 11 | 
答案1
得分: 3
我会首先重新设计列,然后进行汇总:
(df.assign(sum_Y=lambda d: d['Number'].where(d['Y_or_N'].eq('Y')),
           sum_A=lambda d: d['Number'].where(d['A_or_B'].eq('A')),
          )
   .rename(columns={'Number': 'sum_all'})
   .groupby('Fruit', as_index=False)[['sum_all', 'sum_Y', 'sum_A']].sum()
)
输出:
   Fruit  sum_all  sum_Y  sum_A
0  apple       14    9.0    8.0
1   kivi       13    4.0   11.0
英文:
I would first rework the columns, then aggregate:
(df.assign(sum_Y=lambda d: d['Number'].where(d['Y_or_N'].eq('Y')),
           sum_A=lambda d: d['Number'].where(d['A_or_B'].eq('A')),
          )
   .rename(columns={'Number': 'sum_all'})
   .groupby('Fruit', as_index=False)[['sum_all', 'sum_Y', 'sum_A']].sum()
)
Output:
   Fruit  sum_all  sum_Y  sum_A
0  apple       14    9.0    8.0
1   kivi       13    4.0   11.0
答案2
得分: 1
这是三种方法可以实现它:
方法 #1:
res = (df
    .Number.pipe(lambda s: pd.DataFrame({
        'Fruit': df.Fruit, 
        'sum_all': s, 
        'sum_Y': s[df.Y_or_N.eq('Y')], 
        'sum_A': s[df.A_or_B.eq('A')]}))
    .groupby('Fruit', as_index=False).sum().convert_dtypes())
方法 #2:
res = pd.DataFrame({
    'sum_all': df.groupby('Fruit').Number.sum(),
    'sum_Y': df[df.Y_or_N.eq('Y')].groupby('Fruit').Number.sum(),
    'sum_A': df[df.A_or_B.eq('A')].groupby('Fruit').Number.sum()}).reset_index()
方法 #3:这是基于 @mozway 出色答案的一种变体,具有以下调整:
- 将常见的 
Number列访问提取为一个 Series,然后通过管道传递到 lambda 函数 - 使用 
convert_dtypes将筛选列的总和转换回整数,其中 NaN 导致浮点数的升级 
res = (df.Number.pipe(lambda s: df
    .assign(sum_Y=lambda d: s[d.Y_or_N.eq('Y')], sum_A=lambda d: s[d.A_or_B.eq('A')]))
    .rename(columns={'Number': 'sum_all'})
    .groupby('Fruit', as_index=False).sum().convert_dtypes()
)
输出:
   Fruit  sum_all  sum_Y  sum_A
0  apple       14      9      8
1   kivi       13      4     11
英文:
Here's are three ways you can do it:
Way #1:
res = ( df
    .Number.pipe(lambda s: pd.DataFrame({
        'Fruit':df.Fruit, 
        'sum_all':s, 
        'sum_Y':s[df.Y_or_N.eq('Y')], 
        'sum_A':s[df.A_or_B.eq('A')]}))
    .groupby('Fruit', as_index=False).sum().convert_dtypes() )
Way #2:
res = pd.DataFrame({
    'sum_all':df.groupby('Fruit').Number.sum(),
    'sum_Y':df[df.Y_or_N.eq('Y')].groupby('Fruit').Number.sum(),
    'sum_A':df[df.A_or_B.eq('A')].groupby('Fruit').Number.sum()}).reset_index()
Way #3: This is a variation on the excellent answer by @mozway with the following tweaks:
- factors out the common 
Numbercolumn access into a Series we pipe into a lambda - uses 
convert_dtypesto get back to int for the sums of filtered columns where NaN caused an upcast to float 
res = (df.Number.pipe(lambda s: df
    .assign(sum_Y=lambda d: s[d.Y_or_N.eq('Y')], sum_A=lambda d: s[d.A_or_B.eq('A')]))
    .rename(columns={'Number': 'sum_all'})
    .groupby('Fruit', as_index=False).sum().convert_dtypes()
)
Output:
   Fruit  sum_all  sum_Y  sum_A
0  apple       14      9      8
1   kivi       13      4     11
答案3
得分: 1
import pandas as pd
data = {'Fruit': ['apple', 'apple', 'apple', 'kivi', 'kivi', 'kivi'],
        'Y_or_N': ['Y', 'N', 'Y', 'N', 'N', 'Y'],
        'A_or_B': ['A', 'A', 'B', 'A', 'B', 'A'],
        'Number': [3, 5, 6, 7, 2, 4]}
df = pd.DataFrame.from_dict(data)
r1 = df.groupby(['Fruit'])['Number'].sum()
r2 = df.groupby(['Fruit']).apply(lambda d: d[d['Y_or_N'].eq('Y')]['Number'].sum())
r3 = df.groupby(['Fruit']).apply(lambda d: d[d['A_or_B'].eq('A')]['Number'].sum())
r = pd.concat([r1, r2, r3], axis=1).set_axis(['Sum_All', 'Sum_Y', 'Sum_A'], axis='columns')
print(r)
英文:
import pandas as pd
data = {'Fruit':['apple', 'apple', 'apple', 'kivi', 'kivi', 'kivi'],
              'Y_or_N': ['Y', 'N', 'Y', 'N', 'N', 'Y'], 
              'A_or_B': ['A', 'A', 'B', 'A', 'B', 'A'],
              'Number': [3, 5, 6, 7, 2, 4]}
df = pd.DataFrame.from_dict(data)
r1 = df.groupby(['Fruit'])['Number'].sum()
r2 = df.groupby(['Fruit']).apply(lambda d: d[d['Y_or_N'].eq('Y')]['Number'].sum())
r3 = df.groupby(['Fruit']).apply(lambda d: d[d['A_or_B'].eq('A')]['Number'].sum())
r = pd.concat([r1, r2, r3], axis=1).set_axis(['Sum_All', 'Sum_Y', 'Sum_A'], axis='columns')
print(r)
       Sum_All  Sum_Y  Sum_A
Fruit                       
apple       14      9      8
kivi        13      4     11
答案4
得分: 1
另一种使用 pd.pivot 的选项:
res_df = df.pivot(index='Fruit', columns=['Y_or_N', 'A_or_B'], values='Number')
res_df = pd.concat([res_df.sum(1).to_frame('sum_all'),
                    res_df.xs('Y', axis=1).sum(1).to_frame('sum_Y'),
                    res_df.xs('A', level=1, axis=1).sum(1).to_frame('sum_A')], axis=1).reset_index()
   Fruit  sum_all  sum_Y  sum_A
0  apple     14.0    9.0    8.0
1   kivi     13.0    4.0   11.0
英文:
Another option with pd.pivot:
res_df = df.pivot(index='Fruit', columns=['Y_or_N', 'A_or_B'], values='Number')
res_df = pd.concat([res_df.sum(1).to_frame('sum_all'),
                    res_df.xs('Y', axis=1).sum(1).to_frame('sum_Y'),
                    res_df.xs('A', level=1, axis=1).sum(1).to_frame('sum_A')], axis=1).reset_index()
   Fruit  sum_all  sum_Y  sum_A
0  apple     14.0    9.0    8.0
1   kivi     13.0    4.0   11.0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论