2023年5月28日 23:37:19go评论87阅读模式

英文:

Pandas groupby all rows greater than _each_ value in column

问题

我遇到了甚至难以表达问题的困难，但大致如下：我需要执行一个分组操作，应用聚合函数，但不是对于分组列中每个值相等的所有行，而是对于分组列中的值大于该列中的每个值的所有行。

我故意用较多的文字来说明。假设我有这个数据框：

import pandas as pd
df = pd.DataFrame({
    'Weight': [40, 50, 60, 70, 40, 60, 80, 100, 60, 40, 50, 70, 60],
    'Height': [150, 160, 170, 180, 190, 160, 150, 180, 170, 200, 210, 160, 180]
})

我执行一个简单的分组操作以找到均值：

gb = df.groupby(['Weight'])['Height'].mean().reset_index()
gb

   Weight  Height
0      40     180
1      50     185
2      60     170
3      70     170
4      80     150
5     100     180

但我需要类似于以下的内容（显然不是真正的代码）：

gb = df.groupby("""一个动态列或某种标记所有行的掩码，其中'Weight'大于其唯一值之一的所有行""")['Height'].mean().reset_index()
gb

我当然可以通过迭代列中的每个唯一值来实现所需的结果，像这样：

res_list = []
for w in sorted(df['Weight'].unique().tolist()):
    res_list.append(df[df['Weight'] > w]['Height'].mean())
gb = pd.DataFrame({'Weight': sorted(df['Weight'].unique().tolist()), 'Height': res_list})
gb

将返回我想要的确切结果：

   Weight  Height
0      40  172.00
1      50  168.75
2      60  167.50
3      70  165.00
4      80  180.00
5     100     NaN

但这种方法在唯一值和需要执行此操作的列数增加时性能非常差。

我并不一定要使用 groupby()，但我有一种感觉，有一种方法可以做到这一点，我只是缺乏找到答案的搜索技巧。

英文:

I'm having trouble even wording the question, honestly, but it goes roughly like this: I need to perform a groupby that applies an aggregate function not for all rows equal to each value in the grouping column, but for all rows where the value of the grouping column is greater than each value in the column.

I'll be purposefully verbose, just in case. Let's say I have this dataframe:

import pandas as pd
df = pd.DataFrame({
    &#39;Weight&#39;: [40, 50, 60, 70, 40, 60, 80, 100, 60, 40, 50, 70, 60],
    &#39;Height&#39;: [150, 160, 170, 180, 190, 160, 150, 180, 170, 200, 210, 160, 180]
    })

On which I perform a simple groupby to find a mean value:

gb = df.groupby([&#39;Weight&#39;])[&#39;Height&#39;].mean().reset_index()
gb

Which returns

   Weight  Height
0      40     180
1      50     185
2      60     170
3      70     170
4      80     150
5     100     180

But I need something like this (not real code obviously):

gb = df.groupby([&quot;&quot;&quot;a dynamic column or a mask of some sort that marks all rows where &#39;Weight&#39; is greater than each of its unique values&quot;&quot;&quot;])[&#39;Height&#39;].mean().reset_index()
gb

I can, of course, achieve the desired result by iterating over each unique value in the column like this:

res_list = []
for w in sorted(df[&#39;Weight&#39;].unique().tolist()):
    res_list.append(df[df[&#39;Weight&#39;] &gt; w][&#39;Height&#39;].mean())
gb = pd.DataFrame({&#39;Weight&#39;: sorted(df[&#39;Weight&#39;].unique().tolist()), &#39;Height&#39;: res_list})
gb

Which will return exactly what I want:

   Weight  Height
0      40  172.00
1      50  168.75
2      60  167.50
3      70  165.00
4      80  180.00
5     100     NaN

But this method scales very poorly with number of unique values and number of columns that I need to perform this operation on.

I'm not married to groupby() specifically, but I have a feeling that there's a way to do it, and I just lack the googling skills to find the answer.

答案1

得分: 3

均值只是总和除以计数。在您的情况下，总和和计数都可以累积：

# 让我们添加另一列以增加趣味性
df = pd.DataFrame(
    {
        "Weight": [40, 50, 60, 70, 40, 60, 80, 100, 60, 40, 50, 70, 60],
        "Height": [150, 160, 170, 180, 190, 160, 150, 180, 170, 200, 210, 160, 180],
        "Age": np.random.randint(20, 100, 13),
    }
)
# 对于每个体重类别，累积身高和年龄的总和，还计算属于该类别的行数。
# [::-1] 反转它，使体重按从高到低排序
tmp = (
    df.groupby("Weight")[["Height", "Age"]]
    .agg(["sum", "count"])
    .swaplevel(axis=1)[::-1]
)
# 获取所有体重高于当前体重的行的总和
tmp = tmp.cumsum() - tmp
# 均值只是总和除以计数
(tmp["sum"] / tmp["count"])[::-1]

英文:

Mean is just sum divided by count. Both sum and count can be accumulated in your case:

# Let&#39;s add another column for fun
df = pd.DataFrame(
    {
        &quot;Weight&quot;: [40, 50, 60, 70, 40, 60, 80, 100, 60, 40, 50, 70, 60],
        &quot;Height&quot;: [150, 160, 170, 180, 190, 160, 150, 180, 170, 200, 210, 160, 180],
        &quot;Age&quot;: np.random.randint(20, 100, 13),
    }
)
# for each Weight class, sum up the Height and Age, also count how many rows
# belonging to that class. The [::-1] reverses it so Weight are sorted from
# highest to lowest
tmp = (
    df.groupby(&quot;Weight&quot;)[[&quot;Height&quot;, &quot;Age&quot;]]
    .agg([&quot;sum&quot;, &quot;count&quot;])
    .swaplevel(axis=1)[::-1]
)
# Take the sum of all rows with Weight higher than the current Weight
tmp = tmp.cumsum() - tmp
# The mean is just the sum divided by count
(tmp[&quot;sum&quot;] / tmp[&quot;count&quot;])[::-1]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用 Pandas 对列中大于每个值的所有行进行分组。

问题

答案1

使用Beautiful Soup清除标签内的内容。

Python Web-Scraping 代码仅在循环中返回第一个迭代。

在另一张表格中查找数据，如果找不到则复制到新文件。

Python2.7 – 使用subprocess执行shell命令并将输出用于其他操作

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。