在Pandas的groupby中按多列进行缩减。

huangapple go评论86阅读模式
英文:

Reduce by multiple columns in pandas groupby

问题

以下是您要翻译的内容:

import pandas as pd

df = pd.DataFrame(
    {
        "group0": [1, 1, 2, 2, 3, 3],
        "group1": ["1", "1", "1", "2", "2", "2"],
        "relevant": [True, False, False, True, True, True],
        "value": [0, 1, 2, 3, 4, 5],
    }
)

我希望生成一个目标:

target = pd.DataFrame(
    {
        "group0": [1, 2, 2, 3],
        "group1": ["1","1", "2", "2",],
        "value": [0, 2, 3, 5],
    }
)

其中"value"是通过以下方式选择的:

  1. 所有正值"relevant"索引中的最大值在"value"列中
  2. 如果不存在正值"relevant"索引,则选择"value"的最大值

这可以通过以下方式实现:

def fun(x):
    tmp = x["value"][x["relevant"]]
    if len(tmp):
        return tmp.max()
    return x["value"].max()

这是否可以高效地实现所需的分组缩减?

编辑:

带有负载的代码如下:

from time import perf_counter()
df = pd.DataFrame(
    {
        "group0": np.random.randint(0, 30,size=10_000_000),
        "group1": np.random.randint(0, 30,size=10_000_000),
        "relevant": np.random.randint(0, 1, size=10_000_000).astype(bool),
        "value": np.random.random_sample(size=10_000_000) * 1000,
    }
)

start = perf_counter()
out = (df
   .sort_values(by=['relevant', 'value'])
   .groupby(['group0', 'group1'], as_index=False)
   ['value'].last()
 )
end = perf_counter()
print("Sort values", end - start)

def fun(x):
    tmp = x["value"][x["relevant"]]
    if len(tmp):
        return tmp.max()
    return x["value"].max()

start = perf_counter()
out = df.groupby(["group0", "group1"]).apply(fun)
end = perf_counter()
print("Apply", end - start)
#Sort values 14.823943354000221
#Apply 1.5050544870009617

使用.apply的解决方案需要1.5秒。使用sort_values的解决方案需要14.82秒。然而,通过减少测试组的大小:

...
        "group0": np.random.randint(0, 500_000,size=10_000_000),
        "group1": np.random.randint(0, 100_000,size=10_000_000),
...

使得sort_values的解决方案性能大大优于.apply的解决方案(15.29秒与1423.84秒)。@mozway提供的sort_values解决方案更可取,除非用户明确知道数据包含小组计数。

英文:

Having dataframe

import pandas as pd

df = pd.DataFrame(
    {
        "group0": [1, 1, 2, 2, 3, 3],
        "group1": ["1", "1", "1", "2", "2", "2"],
        "relevant": [True, False, False, True, True, True],
        "value": [0, 1, 2, 3, 4, 5],
    }
)

I wish to produce a target

target = pd.DataFrame(
    {
        "group0": [1, 2, 2, 3],
        "group1": ["1","1", "2", "2",],
        "value": [0, 2, 3, 5],
    }
)

where "value" has been chosen by

  1. Maximum of all positive "relevant" indices in "value" column
  2. Otherwise maximum of "value" if no positive "relevant" indices exist

This would be produced by

def fun(x):
    tmp = x["value"][x["relevant"]]
    if len(tmp):
        return tmp.max()
    return x["value"].max()

were x a groupby dataframe.

Is it possible to achive the desired groupby reduction efficiently?

EDIT:

with payload

from time import perf_counter()
df = pd.DataFrame(
    {
        "group0": np.random.randint(0, 30,size=10_000_000),
        "group1": np.random.randint(0, 30,size=10_000_000),
        "relevant": np.random.randint(0, 1, size=10_000_000).astype(bool),
        "value": np.random.random_sample(size=10_000_000) * 1000,
    }
)

start = perf_counter()
out = (df
   .sort_values(by=['relevant', 'value'])
   .groupby(['group0', 'group1'], as_index=False)
   ['value'].last()
 )
end = perf_counter()
print("Sort values", end - start)

def fun(x):
    tmp = x["value"][x["relevant"]]
    if len(tmp):
        return tmp.max()
    return x["value"].max()

start = perf_counter()
out = df.groupby(["group0", "group1"]).apply(fun)
end = perf_counter()
print("Apply", end - start)
#Sort values 14.823943354000221
#Apply 1.5050544870009617

.apply-solution got time of 1.5s. The solution with sort_values performed with 14.82s. However, reducing sizes of the test groups with

...
        "group0": np.random.randint(0, 500_000,size=10_000_000),
        "group1": np.random.randint(0, 100_000,size=10_000_000),
...

led to vastly superior performance by the sort_values solution.
(15.29s versus 1423.84s). sort_values solution by @mozway is preferred, unless user specifically knows that data contains small group counts.

答案1

得分: 2

以下是已翻译的内容:

是的,可以有效地实现所需的 groupby 缩减。以下是修改后的代码:

def fun(x):
    tmp = x["value"][x["relevant"]]
    if len(tmp):
        return tmp.max()
    return x["value"].max()

您可以使用这段代码获得所需的输出,而无需修改您的函数,并且可以在您的函数中应用 groupby 函数。

target = df.groupby(["group0", "group1"]).apply(fun).reset_index(name='value')
print(target)

以下是期望的输出:

     group0  group1  value
0       1       1      0
1       2       1      2
2       2       2      3
3       3       2      5
英文:

Yes, it is possible to achieve the desired groupby reduction efficiently. Here is the modified code:

def fun(x):
    tmp = x["value"][x["relevant"]]
    if len(tmp):
        return tmp.max()
    return x["value"].max()

You can achieve this to get the desired output without modifying your function and you can apply groupby function in your function.

target  = df.groupby(["group0", "group1"]).apply(fun).reset_index(name='value')
print(target)

here is the desired output:

     group0   group1  value
0       1      1      0
1       2      1      2
2       2      2      3
3       3      2      5

答案2

得分: 2

按照以下方式排序数值,将True值置于最后,然后使用 groupby.last

out = (df
   .sort_values(by=['relevant', 'value'])
   .groupby(['group0', 'group1'], as_index=False)
   ['value'].last()
 )

输出结果:

   group0 group1  value
0       1      1      0
1       2      1      2
2       2      2      3
3       3      2      5

聚合之前的中间结果:

* 选定的行

   group0 group1  relevant  value
1       1      1     False      1
2       2      1     False      2  *
0       1      1      True      0  *
3       2      2      True      3  *
4       3      2      True      4
5       3      2      True      5  *
英文:

Sort the values to put True, then highest number last and use a groupby.last:

out = (df
   .sort_values(by=['relevant', 'value'])
   .groupby(['group0', 'group1'], as_index=False)
   ['value'].last()
 )

Output:

   group0 group1  value
0       1      1      0
1       2      1      2
2       2      2      3
3       3      2      5

Intermediate before aggregation:

* selected rows

   group0 group1  relevant  value
1       1      1     False      1
2       2      1     False      2  *
0       1      1      True      0  *
3       2      2      True      3  *
4       3      2      True      4
5       3      2      True      5  *

答案3

得分: 2

这是您要翻译的代码部分:

group_keys = ["group0","group1"]
rev = df[df.relevant].groupby(group_keys).max()
nrev = df[~df.relevant].groupby(group_keys).max()
merged = rev.merge(nrev, on=group_keys, how='outer')
merged['value'] =  merged.value_x.where(merged.relevant_x, merged.value_y )
merged
英文:

Probably not as elegant as other solutions, but also works:

group_keys = ["group0","group1"]
rev = df[df.relevant].groupby(group_keys).max()
nrev = df[~df.relevant].groupby(group_keys).max()
merged = rev.merge(nrev, on=group_keys, how='outer')
merged['value'] =  merged.value_x.where(merged.relevant_x, merged.value_y )
merged

huangapple
  • 本文由 发表于 2023年6月29日 16:17:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76579241.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定