使用 Pandas 对列中大于每个值的所有行进行分组。

huangapple go评论65阅读模式
英文:

Pandas groupby all rows greater than _each_ value in column

问题

我遇到了甚至难以表达问题的困难,但大致如下:我需要执行一个分组操作,应用聚合函数,但不是对于分组列中每个值相等的所有行,而是对于分组列中的值大于该列中的每个值的所有行。

我故意用较多的文字来说明。假设我有这个数据框:

import pandas as pd
df = pd.DataFrame({
    'Weight': [40, 50, 60, 70, 40, 60, 80, 100, 60, 40, 50, 70, 60],
    'Height': [150, 160, 170, 180, 190, 160, 150, 180, 170, 200, 210, 160, 180]
})

我执行一个简单的分组操作以找到均值:

gb = df.groupby(['Weight'])['Height'].mean().reset_index()
gb

返回:

   Weight  Height
0      40     180
1      50     185
2      60     170
3      70     170
4      80     150
5     100     180

但我需要类似于以下的内容(显然不是真正的代码):

gb = df.groupby("""一个动态列或某种标记所有行的掩码,其中'Weight'大于其唯一值之一的所有行""")['Height'].mean().reset_index()
gb

我当然可以通过迭代列中的每个唯一值来实现所需的结果,像这样:

res_list = []
for w in sorted(df['Weight'].unique().tolist()):
    res_list.append(df[df['Weight'] > w]['Height'].mean())
gb = pd.DataFrame({'Weight': sorted(df['Weight'].unique().tolist()), 'Height': res_list})
gb

将返回我想要的确切结果:

   Weight  Height
0      40  172.00
1      50  168.75
2      60  167.50
3      70  165.00
4      80  180.00
5     100     NaN

但这种方法在唯一值和需要执行此操作的列数增加时性能非常差。

我并不一定要使用 groupby(),但我有一种感觉,有一种方法可以做到这一点,我只是缺乏找到答案的搜索技巧。

英文:

I'm having trouble even wording the question, honestly, but it goes roughly like this: I need to perform a groupby that applies an aggregate function not for all rows equal to each value in the grouping column, but for all rows where the value of the grouping column is greater than each value in the column.

I'll be purposefully verbose, just in case. Let's say I have this dataframe:

import pandas as pd
df = pd.DataFrame({
    'Weight': [40, 50, 60, 70, 40, 60, 80, 100, 60, 40, 50, 70, 60],
    'Height': [150, 160, 170, 180, 190, 160, 150, 180, 170, 200, 210, 160, 180]
    })

On which I perform a simple groupby to find a mean value:

gb = df.groupby(['Weight'])['Height'].mean().reset_index()
gb

Which returns

   Weight  Height
0      40     180
1      50     185
2      60     170
3      70     170
4      80     150
5     100     180

But I need something like this (not real code obviously):

gb = df.groupby(["""a dynamic column or a mask of some sort that marks all rows where 'Weight' is greater than each of its unique values"""])['Height'].mean().reset_index()
gb

I can, of course, achieve the desired result by iterating over each unique value in the column like this:

res_list = []
for w in sorted(df['Weight'].unique().tolist()):
    res_list.append(df[df['Weight'] > w]['Height'].mean())
gb = pd.DataFrame({'Weight': sorted(df['Weight'].unique().tolist()), 'Height': res_list})
gb

Which will return exactly what I want:

   Weight  Height
0      40  172.00
1      50  168.75
2      60  167.50
3      70  165.00
4      80  180.00
5     100     NaN

But this method scales very poorly with number of unique values and number of columns that I need to perform this operation on.

I'm not married to groupby() specifically, but I have a feeling that there's a way to do it, and I just lack the googling skills to find the answer.

答案1

得分: 3

均值只是总和除以计数。在您的情况下,总和和计数都可以累积:

# 让我们添加另一列以增加趣味性
df = pd.DataFrame(
    {
        "Weight": [40, 50, 60, 70, 40, 60, 80, 100, 60, 40, 50, 70, 60],
        "Height": [150, 160, 170, 180, 190, 160, 150, 180, 170, 200, 210, 160, 180],
        "Age": np.random.randint(20, 100, 13),
    }
)

# 对于每个体重类别,累积身高和年龄的总和,还计算属于该类别的行数。
# [::-1] 反转它,使体重按从高到低排序
tmp = (
    df.groupby("Weight")[["Height", "Age"]]
    .agg(["sum", "count"])
    .swaplevel(axis=1)[::-1]
)

# 获取所有体重高于当前体重的行的总和
tmp = tmp.cumsum() - tmp

# 均值只是总和除以计数
(tmp["sum"] / tmp["count"])[::-1]
英文:

Mean is just sum divided by count. Both sum and count can be accumulated in your case:

# Let's add another column for fun
df = pd.DataFrame(
    {
        "Weight": [40, 50, 60, 70, 40, 60, 80, 100, 60, 40, 50, 70, 60],
        "Height": [150, 160, 170, 180, 190, 160, 150, 180, 170, 200, 210, 160, 180],
        "Age": np.random.randint(20, 100, 13),
    }
)

# for each Weight class, sum up the Height and Age, also count how many rows
# belonging to that class. The [::-1] reverses it so Weight are sorted from
# highest to lowest
tmp = (
    df.groupby("Weight")[["Height", "Age"]]
    .agg(["sum", "count"])
    .swaplevel(axis=1)[::-1]
)

# Take the sum of all rows with Weight higher than the current Weight
tmp = tmp.cumsum() - tmp

# The mean is just the sum divided by count
(tmp["sum"] / tmp["count"])[::-1]

huangapple
  • 本文由 发表于 2023年5月28日 23:37:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76352264.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定