在Python Pandas中为groupby聚合函数应用筛选器

huangapple go评论109阅读模式
英文:

Apply filter for groupby aggregate function in Python Pandas

问题

如何在Pandas中为groupby聚合函数应用过滤器?

我有一个DataFrame

  1. data = {'Fruit':['apple', 'apple', 'apple', 'kivi', 'kivi', 'kivi'],
  2. 'Y_or_N': ['Y', 'N', 'Y', 'N', 'N', 'Y'],
  3. 'A_or_B': ['A', 'A', 'B', 'A', 'B', 'A'],
  4. 'Number': [3, 5, 6, 7, 2, 4]}
  5. df = pd.DataFrame.from_dict(data)

我想对每个水果组在3列中求和Number值:(1) 所有值,(2) 其中'Y_or_N'=='Y',(3) 其中'A_or_B'=='A'。

我尝试了以下方法:

  1. new_df = df.groupby(['Fruit']).apply(lambda x: x[x['Y_or_N'] == 'Y' ].agg(sum_Y=('Number', 'sum')))

这个方法有效,但仅适用于一个列。是否有更有效的方法来为不同列应用不同的过滤器和聚合函数?而不是创建3个数据框然后将它们合并在一起。

期望的输出:

Fruit sum_all sum_Y sum_A
apple 14 9 8
kivi 13 4 11
英文:

How to apply filter for groupby aggregate function in Pandas?

I have DataFrame

  1. data = {'Fruit':['apple', 'apple', 'apple', 'kivi', 'kivi', 'kivi'],
  2. 'Y_or_N': ['Y', 'N', 'Y', 'N', 'N', 'Y'],
  3. 'A_or_B': ['A', 'A', 'B', 'A', 'B', 'A'],
  4. 'Number': [3, 5, 6, 7, 2, 4]}
  5. df = pd.DataFrame.from_dict(data)

I want for each fruit group sum Number values in 3 columns: (1) all values, (2) where 'Y_or_N'=='Y', (3) where 'A_or_B'=='A'.

I have tried the following:

  1. new_df = df.groupby(['Fruit']).apply(lambda x: x[x['Y_or_N'] == 'Y' ].agg(sum_Y=('Number', 'sum')))

This works, but only for 1 column. Is there a more efficient way to apply different filters for different columns and aggregate functions? Without making 3 df and then merging them together.

Desired output:

Fruit sum_all sum_Y sum_A
apple 14 9 8
kivi 13 4 11

答案1

得分: 3

我会首先重新设计列,然后进行汇总:

  1. (df.assign(sum_Y=lambda d: d['Number'].where(d['Y_or_N'].eq('Y')),
  2. sum_A=lambda d: d['Number'].where(d['A_or_B'].eq('A')),
  3. )
  4. .rename(columns={'Number': 'sum_all'})
  5. .groupby('Fruit', as_index=False)[['sum_all', 'sum_Y', 'sum_A']].sum()
  6. )

输出:

  1. Fruit sum_all sum_Y sum_A
  2. 0 apple 14 9.0 8.0
  3. 1 kivi 13 4.0 11.0
英文:

I would first rework the columns, then aggregate:

  1. (df.assign(sum_Y=lambda d: d['Number'].where(d['Y_or_N'].eq('Y')),
  2. sum_A=lambda d: d['Number'].where(d['A_or_B'].eq('A')),
  3. )
  4. .rename(columns={'Number': 'sum_all'})
  5. .groupby('Fruit', as_index=False)[['sum_all', 'sum_Y', 'sum_A']].sum()
  6. )

Output:

  1. Fruit sum_all sum_Y sum_A
  2. 0 apple 14 9.0 8.0
  3. 1 kivi 13 4.0 11.0

答案2

得分: 1

这是三种方法可以实现它:

方法 #1:

  1. res = (df
  2. .Number.pipe(lambda s: pd.DataFrame({
  3. 'Fruit': df.Fruit,
  4. 'sum_all': s,
  5. 'sum_Y': s[df.Y_or_N.eq('Y')],
  6. 'sum_A': s[df.A_or_B.eq('A')]}))
  7. .groupby('Fruit', as_index=False).sum().convert_dtypes())

方法 #2:

  1. res = pd.DataFrame({
  2. 'sum_all': df.groupby('Fruit').Number.sum(),
  3. 'sum_Y': df[df.Y_or_N.eq('Y')].groupby('Fruit').Number.sum(),
  4. 'sum_A': df[df.A_or_B.eq('A')].groupby('Fruit').Number.sum()}).reset_index()

方法 #3:这是基于 @mozway 出色答案的一种变体,具有以下调整:

  • 将常见的 Number 列访问提取为一个 Series,然后通过管道传递到 lambda 函数
  • 使用 convert_dtypes 将筛选列的总和转换回整数,其中 NaN 导致浮点数的升级
  1. res = (df.Number.pipe(lambda s: df
  2. .assign(sum_Y=lambda d: s[d.Y_or_N.eq('Y')], sum_A=lambda d: s[d.A_or_B.eq('A')]))
  3. .rename(columns={'Number': 'sum_all'})
  4. .groupby('Fruit', as_index=False).sum().convert_dtypes()
  5. )

输出:

  1. Fruit sum_all sum_Y sum_A
  2. 0 apple 14 9 8
  3. 1 kivi 13 4 11
英文:

Here's are three ways you can do it:

Way #1:

  1. res = ( df
  2. .Number.pipe(lambda s: pd.DataFrame({
  3. 'Fruit':df.Fruit,
  4. 'sum_all':s,
  5. 'sum_Y':s[df.Y_or_N.eq('Y')],
  6. 'sum_A':s[df.A_or_B.eq('A')]}))
  7. .groupby('Fruit', as_index=False).sum().convert_dtypes() )

Way #2:

  1. res = pd.DataFrame({
  2. 'sum_all':df.groupby('Fruit').Number.sum(),
  3. 'sum_Y':df[df.Y_or_N.eq('Y')].groupby('Fruit').Number.sum(),
  4. 'sum_A':df[df.A_or_B.eq('A')].groupby('Fruit').Number.sum()}).reset_index()

Way #3: This is a variation on the excellent answer by @mozway with the following tweaks:

  • factors out the common Number column access into a Series we pipe into a lambda
  • uses convert_dtypes to get back to int for the sums of filtered columns where NaN caused an upcast to float
  1. res = (df.Number.pipe(lambda s: df
  2. .assign(sum_Y=lambda d: s[d.Y_or_N.eq('Y')], sum_A=lambda d: s[d.A_or_B.eq('A')]))
  3. .rename(columns={'Number': 'sum_all'})
  4. .groupby('Fruit', as_index=False).sum().convert_dtypes()
  5. )

Output:

  1. Fruit sum_all sum_Y sum_A
  2. 0 apple 14 9 8
  3. 1 kivi 13 4 11

答案3

得分: 1

  1. import pandas as pd
  2. data = {'Fruit': ['apple', 'apple', 'apple', 'kivi', 'kivi', 'kivi'],
  3. 'Y_or_N': ['Y', 'N', 'Y', 'N', 'N', 'Y'],
  4. 'A_or_B': ['A', 'A', 'B', 'A', 'B', 'A'],
  5. 'Number': [3, 5, 6, 7, 2, 4]}
  6. df = pd.DataFrame.from_dict(data)
  7. r1 = df.groupby(['Fruit'])['Number'].sum()
  8. r2 = df.groupby(['Fruit']).apply(lambda d: d[d['Y_or_N'].eq('Y')]['Number'].sum())
  9. r3 = df.groupby(['Fruit']).apply(lambda d: d[d['A_or_B'].eq('A')]['Number'].sum())
  10. r = pd.concat([r1, r2, r3], axis=1).set_axis(['Sum_All', 'Sum_Y', 'Sum_A'], axis='columns')
  11. print(r)
英文:
  1. import pandas as pd
  2. data = {'Fruit':['apple', 'apple', 'apple', 'kivi', 'kivi', 'kivi'],
  3. 'Y_or_N': ['Y', 'N', 'Y', 'N', 'N', 'Y'],
  4. 'A_or_B': ['A', 'A', 'B', 'A', 'B', 'A'],
  5. 'Number': [3, 5, 6, 7, 2, 4]}
  6. df = pd.DataFrame.from_dict(data)
  7. r1 = df.groupby(['Fruit'])['Number'].sum()
  8. r2 = df.groupby(['Fruit']).apply(lambda d: d[d['Y_or_N'].eq('Y')]['Number'].sum())
  9. r3 = df.groupby(['Fruit']).apply(lambda d: d[d['A_or_B'].eq('A')]['Number'].sum())
  10. r = pd.concat([r1, r2, r3], axis=1).set_axis(['Sum_All', 'Sum_Y', 'Sum_A'], axis='columns')
  11. print(r)
  1. Sum_All Sum_Y Sum_A
  2. Fruit
  3. apple 14 9 8
  4. kivi 13 4 11

答案4

得分: 1

另一种使用 pd.pivot 的选项:

  1. res_df = df.pivot(index='Fruit', columns=['Y_or_N', 'A_or_B'], values='Number')
  2. res_df = pd.concat([res_df.sum(1).to_frame('sum_all'),
  3. res_df.xs('Y', axis=1).sum(1).to_frame('sum_Y'),
  4. res_df.xs('A', level=1, axis=1).sum(1).to_frame('sum_A')], axis=1).reset_index()

  1. Fruit sum_all sum_Y sum_A
  2. 0 apple 14.0 9.0 8.0
  3. 1 kivi 13.0 4.0 11.0
英文:

Another option with pd.pivot:

  1. res_df = df.pivot(index='Fruit', columns=['Y_or_N', 'A_or_B'], values='Number')
  2. res_df = pd.concat([res_df.sum(1).to_frame('sum_all'),
  3. res_df.xs('Y', axis=1).sum(1).to_frame('sum_Y'),
  4. res_df.xs('A', level=1, axis=1).sum(1).to_frame('sum_A')], axis=1).reset_index()

  1. Fruit sum_all sum_Y sum_A
  2. 0 apple 14.0 9.0 8.0
  3. 1 kivi 13.0 4.0 11.0

huangapple
  • 本文由 发表于 2023年3月7日 22:55:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75663579.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定