2023年6月29日 16:17:43go评论104阅读模式

英文:

Reduce by multiple columns in pandas groupby

问题

以下是您要翻译的内容：

import pandas as pd
df = pd.DataFrame(
    {
        &quot;group0&quot;: [1, 1, 2, 2, 3, 3],
        &quot;group1&quot;: [&quot;1&quot;, &quot;1&quot;, &quot;1&quot;, &quot;2&quot;, &quot;2&quot;, &quot;2&quot;],
        &quot;relevant&quot;: [True, False, False, True, True, True],
        &quot;value&quot;: [0, 1, 2, 3, 4, 5],
    }
)

我希望生成一个目标：

target = pd.DataFrame(
    {
        &quot;group0&quot;: [1, 2, 2, 3],
        &quot;group1&quot;: [&quot;1&quot;,&quot;1&quot;, &quot;2&quot;, &quot;2&quot;,],
        &quot;value&quot;: [0, 2, 3, 5],
    }
)

其中"value"是通过以下方式选择的：

所有正值"relevant"索引中的最大值在"value"列中
如果不存在正值"relevant"索引，则选择"value"的最大值

这可以通过以下方式实现：

def fun(x):
    tmp = x[&quot;value&quot;][x[&quot;relevant&quot;]]
    if len(tmp):
        return tmp.max()
    return x[&quot;value&quot;].max()

这是否可以高效地实现所需的分组缩减？

编辑：

带有负载的代码如下：

from time import perf_counter()
df = pd.DataFrame(
    {
        &quot;group0&quot;: np.random.randint(0, 30,size=10_000_000),
        &quot;group1&quot;: np.random.randint(0, 30,size=10_000_000),
        &quot;relevant&quot;: np.random.randint(0, 1, size=10_000_000).astype(bool),
        &quot;value&quot;: np.random.random_sample(size=10_000_000) * 1000,
    }
)
start = perf_counter()
out = (df
   .sort_values(by=[&#39;relevant&#39;, &#39;value&#39;])
   .groupby([&#39;group0&#39;, &#39;group1&#39;], as_index=False)
   [&#39;value&#39;].last()
 )
end = perf_counter()
print(&quot;Sort values&quot;, end - start)
def fun(x):
    tmp = x[&quot;value&quot;][x[&quot;relevant&quot;]]
    if len(tmp):
        return tmp.max()
    return x[&quot;value&quot;].max()
start = perf_counter()
out = df.groupby([&quot;group0&quot;, &quot;group1&quot;]).apply(fun)
end = perf_counter()
print(&quot;Apply&quot;, end - start)
#Sort values 14.823943354000221
#Apply 1.5050544870009617

使用.apply的解决方案需要1.5秒。使用sort_values的解决方案需要14.82秒。然而，通过减少测试组的大小：

...
        &quot;group0&quot;: np.random.randint(0, 500_000,size=10_000_000),
        &quot;group1&quot;: np.random.randint(0, 100_000,size=10_000_000),
...

使得sort_values的解决方案性能大大优于.apply的解决方案（15.29秒与1423.84秒）。@mozway提供的sort_values解决方案更可取，除非用户明确知道数据包含小组计数。

英文:

Having dataframe

import pandas as pd
df = pd.DataFrame(
    {
        &quot;group0&quot;: [1, 1, 2, 2, 3, 3],
        &quot;group1&quot;: [&quot;1&quot;, &quot;1&quot;, &quot;1&quot;, &quot;2&quot;, &quot;2&quot;, &quot;2&quot;],
        &quot;relevant&quot;: [True, False, False, True, True, True],
        &quot;value&quot;: [0, 1, 2, 3, 4, 5],
    }
)

I wish to produce a target

target = pd.DataFrame(
    {
        &quot;group0&quot;: [1, 2, 2, 3],
        &quot;group1&quot;: [&quot;1&quot;,&quot;1&quot;, &quot;2&quot;, &quot;2&quot;,],
        &quot;value&quot;: [0, 2, 3, 5],
    }
)

where "value" has been chosen by

Maximum of all positive "relevant" indices in "value" column
Otherwise maximum of "value" if no positive "relevant" indices exist

This would be produced by

def fun(x):
    tmp = x[&quot;value&quot;][x[&quot;relevant&quot;]]
    if len(tmp):
        return tmp.max()
    return x[&quot;value&quot;].max()

were x a groupby dataframe.

Is it possible to achive the desired groupby reduction efficiently?

EDIT:

with payload

from time import perf_counter()
df = pd.DataFrame(
    {
        &quot;group0&quot;: np.random.randint(0, 30,size=10_000_000),
        &quot;group1&quot;: np.random.randint(0, 30,size=10_000_000),
        &quot;relevant&quot;: np.random.randint(0, 1, size=10_000_000).astype(bool),
        &quot;value&quot;: np.random.random_sample(size=10_000_000) * 1000,
    }
)
start = perf_counter()
out = (df
   .sort_values(by=[&#39;relevant&#39;, &#39;value&#39;])
   .groupby([&#39;group0&#39;, &#39;group1&#39;], as_index=False)
   [&#39;value&#39;].last()
 )
end = perf_counter()
print(&quot;Sort values&quot;, end - start)
def fun(x):
    tmp = x[&quot;value&quot;][x[&quot;relevant&quot;]]
    if len(tmp):
        return tmp.max()
    return x[&quot;value&quot;].max()
start = perf_counter()
out = df.groupby([&quot;group0&quot;, &quot;group1&quot;]).apply(fun)
end = perf_counter()
print(&quot;Apply&quot;, end - start)
#Sort values 14.823943354000221
#Apply 1.5050544870009617

.apply-solution got time of 1.5s. The solution with sort_values performed with 14.82s. However, reducing sizes of the test groups with

...
        &quot;group0&quot;: np.random.randint(0, 500_000,size=10_000_000),
        &quot;group1&quot;: np.random.randint(0, 100_000,size=10_000_000),
...

led to vastly superior performance by the sort_values solution.
(15.29s versus 1423.84s). sort_values solution by @mozway is preferred, unless user specifically knows that data contains small group counts.

答案1

得分: 2

以下是已翻译的内容：

是的，可以有效地实现所需的 groupby 缩减。以下是修改后的代码：

def fun(x):
    tmp = x["value"][x["relevant"]]
    if len(tmp):
        return tmp.max()
    return x["value"].max()

您可以使用这段代码获得所需的输出，而无需修改您的函数，并且可以在您的函数中应用 groupby 函数。

target = df.groupby(["group0", "group1"]).apply(fun).reset_index(name='value')
print(target)

以下是期望的输出：

     group0  group1  value
0       1       1      0
1       2       1      2
2       2       2      3
3       3       2      5

英文:

Yes, it is possible to achieve the desired groupby reduction efficiently. Here is the modified code:

def fun(x):
    tmp = x[&quot;value&quot;][x[&quot;relevant&quot;]]
    if len(tmp):
        return tmp.max()
    return x[&quot;value&quot;].max()

You can achieve this to get the desired output without modifying your function and you can apply groupby function in your function.

target  = df.groupby([&quot;group0&quot;, &quot;group1&quot;]).apply(fun).reset_index(name=&#39;value&#39;)
print(target)

here is the desired output:

     group0   group1  value
0       1      1      0
1       2      1      2
2       2      2      3
3       3      2      5

答案2

得分: 2

按照以下方式排序数值，将True值置于最后，然后使用 groupby.last：

out = (df
   .sort_values(by=['relevant', 'value'])
   .groupby(['group0', 'group1'], as_index=False)
   ['value'].last()
 )

输出结果：

   group0 group1  value
0       1      1      0
1       2      1      2
2       2      2      3
3       3      2      5

聚合之前的中间结果：

* 选定的行

   group0 group1  relevant  value
1       1      1     False      1
2       2      1     False      2  *
0       1      1      True      0  *
3       2      2      True      3  *
4       3      2      True      4
5       3      2      True      5  *

英文:

Sort the values to put True, then highest number last and use a groupby.last:

out = (df
   .sort_values(by=[&#39;relevant&#39;, &#39;value&#39;])
   .groupby([&#39;group0&#39;, &#39;group1&#39;], as_index=False)
   [&#39;value&#39;].last()
 )

Output:

   group0 group1  value
0       1      1      0
1       2      1      2
2       2      2      3
3       3      2      5

Intermediate before aggregation:

* selected rows

   group0 group1  relevant  value
1       1      1     False      1
2       2      1     False      2  *
0       1      1      True      0  *
3       2      2      True      3  *
4       3      2      True      4
5       3      2      True      5  *

答案3

得分: 2

这是您要翻译的代码部分：

group_keys = ["group0","group1"]
rev = df[df.relevant].groupby(group_keys).max()
nrev = df[~df.relevant].groupby(group_keys).max()
merged = rev.merge(nrev, on=group_keys, how='outer')
merged['value'] =  merged.value_x.where(merged.relevant_x, merged.value_y )
merged

英文:

Probably not as elegant as other solutions, but also works:

group_keys = [&quot;group0&quot;,&quot;group1&quot;]
rev = df[df.relevant].groupby(group_keys).max()
nrev = df[~df.relevant].groupby(group_keys).max()
merged = rev.merge(nrev, on=group_keys, how=&#39;outer&#39;)
merged[&#39;value&#39;] =  merged.value_x.where(merged.relevant_x, merged.value_y )
merged

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Pandas的groupby中按多列进行缩减。

问题

答案1

答案2

答案3

PyTables 在 macOS M1 上安装时与 Python 3.11 失败。

你可以在Snowpark中从GitHub运行Python代码吗？

合并两个数据框，如果一个字符串列表匹配，则将不匹配的字符串列为NA。

读取文件中的值，并根据它们的类型在Python中进行转换。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论