英文:
Reduce by multiple columns in pandas groupby
问题
以下是您要翻译的内容:
import pandas as pd
df = pd.DataFrame(
{
"group0": [1, 1, 2, 2, 3, 3],
"group1": ["1", "1", "1", "2", "2", "2"],
"relevant": [True, False, False, True, True, True],
"value": [0, 1, 2, 3, 4, 5],
}
)
我希望生成一个目标:
target = pd.DataFrame(
{
"group0": [1, 2, 2, 3],
"group1": ["1","1", "2", "2",],
"value": [0, 2, 3, 5],
}
)
其中"value"
是通过以下方式选择的:
- 所有正值
"relevant"
索引中的最大值在"value"
列中 - 如果不存在正值
"relevant"
索引,则选择"value"
的最大值
这可以通过以下方式实现:
def fun(x):
tmp = x["value"][x["relevant"]]
if len(tmp):
return tmp.max()
return x["value"].max()
这是否可以高效地实现所需的分组缩减?
编辑:
带有负载的代码如下:
from time import perf_counter()
df = pd.DataFrame(
{
"group0": np.random.randint(0, 30,size=10_000_000),
"group1": np.random.randint(0, 30,size=10_000_000),
"relevant": np.random.randint(0, 1, size=10_000_000).astype(bool),
"value": np.random.random_sample(size=10_000_000) * 1000,
}
)
start = perf_counter()
out = (df
.sort_values(by=['relevant', 'value'])
.groupby(['group0', 'group1'], as_index=False)
['value'].last()
)
end = perf_counter()
print("Sort values", end - start)
def fun(x):
tmp = x["value"][x["relevant"]]
if len(tmp):
return tmp.max()
return x["value"].max()
start = perf_counter()
out = df.groupby(["group0", "group1"]).apply(fun)
end = perf_counter()
print("Apply", end - start)
#Sort values 14.823943354000221
#Apply 1.5050544870009617
使用.apply
的解决方案需要1.5秒。使用sort_values
的解决方案需要14.82秒。然而,通过减少测试组的大小:
...
"group0": np.random.randint(0, 500_000,size=10_000_000),
"group1": np.random.randint(0, 100_000,size=10_000_000),
...
使得sort_values
的解决方案性能大大优于.apply
的解决方案(15.29秒与1423.84秒)。@mozway提供的sort_values
解决方案更可取,除非用户明确知道数据包含小组计数。
英文:
Having dataframe
import pandas as pd
df = pd.DataFrame(
{
"group0": [1, 1, 2, 2, 3, 3],
"group1": ["1", "1", "1", "2", "2", "2"],
"relevant": [True, False, False, True, True, True],
"value": [0, 1, 2, 3, 4, 5],
}
)
I wish to produce a target
target = pd.DataFrame(
{
"group0": [1, 2, 2, 3],
"group1": ["1","1", "2", "2",],
"value": [0, 2, 3, 5],
}
)
where "value"
has been chosen by
- Maximum of all positive
"relevant"
indices in"value"
column - Otherwise maximum of
"value"
if no positive"relevant"
indices exist
This would be produced by
def fun(x):
tmp = x["value"][x["relevant"]]
if len(tmp):
return tmp.max()
return x["value"].max()
were x
a groupby dataframe.
Is it possible to achive the desired groupby reduction efficiently?
EDIT:
with payload
from time import perf_counter()
df = pd.DataFrame(
{
"group0": np.random.randint(0, 30,size=10_000_000),
"group1": np.random.randint(0, 30,size=10_000_000),
"relevant": np.random.randint(0, 1, size=10_000_000).astype(bool),
"value": np.random.random_sample(size=10_000_000) * 1000,
}
)
start = perf_counter()
out = (df
.sort_values(by=['relevant', 'value'])
.groupby(['group0', 'group1'], as_index=False)
['value'].last()
)
end = perf_counter()
print("Sort values", end - start)
def fun(x):
tmp = x["value"][x["relevant"]]
if len(tmp):
return tmp.max()
return x["value"].max()
start = perf_counter()
out = df.groupby(["group0", "group1"]).apply(fun)
end = perf_counter()
print("Apply", end - start)
#Sort values 14.823943354000221
#Apply 1.5050544870009617
.apply
-solution got time of 1.5s. The solution with sort_values
performed with 14.82s. However, reducing sizes of the test groups with
...
"group0": np.random.randint(0, 500_000,size=10_000_000),
"group1": np.random.randint(0, 100_000,size=10_000_000),
...
led to vastly superior performance by the sort_values
solution.
(15.29s versus 1423.84s). sort_values
solution by @mozway is preferred, unless user specifically knows that data contains small group counts.
答案1
得分: 2
以下是已翻译的内容:
是的,可以有效地实现所需的 groupby
缩减。以下是修改后的代码:
def fun(x):
tmp = x["value"][x["relevant"]]
if len(tmp):
return tmp.max()
return x["value"].max()
您可以使用这段代码获得所需的输出,而无需修改您的函数,并且可以在您的函数中应用 groupby
函数。
target = df.groupby(["group0", "group1"]).apply(fun).reset_index(name='value')
print(target)
以下是期望的输出:
group0 group1 value
0 1 1 0
1 2 1 2
2 2 2 3
3 3 2 5
英文:
Yes, it is possible to achieve the desired groupby
reduction efficiently. Here is the modified code:
def fun(x):
tmp = x["value"][x["relevant"]]
if len(tmp):
return tmp.max()
return x["value"].max()
You can achieve this to get the desired output without modifying your function and you can apply groupby function in your function.
target = df.groupby(["group0", "group1"]).apply(fun).reset_index(name='value')
print(target)
here is the desired output:
group0 group1 value
0 1 1 0
1 2 1 2
2 2 2 3
3 3 2 5
答案2
得分: 2
按照以下方式排序数值,将True
值置于最后,然后使用 groupby.last
:
out = (df
.sort_values(by=['relevant', 'value'])
.groupby(['group0', 'group1'], as_index=False)
['value'].last()
)
输出结果:
group0 group1 value
0 1 1 0
1 2 1 2
2 2 2 3
3 3 2 5
聚合之前的中间结果:
* 选定的行
group0 group1 relevant value
1 1 1 False 1
2 2 1 False 2 *
0 1 1 True 0 *
3 2 2 True 3 *
4 3 2 True 4
5 3 2 True 5 *
英文:
Sort the values to put True
, then highest number last and use a groupby.last
:
out = (df
.sort_values(by=['relevant', 'value'])
.groupby(['group0', 'group1'], as_index=False)
['value'].last()
)
Output:
group0 group1 value
0 1 1 0
1 2 1 2
2 2 2 3
3 3 2 5
Intermediate before aggregation:
* selected rows
group0 group1 relevant value
1 1 1 False 1
2 2 1 False 2 *
0 1 1 True 0 *
3 2 2 True 3 *
4 3 2 True 4
5 3 2 True 5 *
答案3
得分: 2
这是您要翻译的代码部分:
group_keys = ["group0","group1"]
rev = df[df.relevant].groupby(group_keys).max()
nrev = df[~df.relevant].groupby(group_keys).max()
merged = rev.merge(nrev, on=group_keys, how='outer')
merged['value'] = merged.value_x.where(merged.relevant_x, merged.value_y )
merged
英文:
Probably not as elegant as other solutions, but also works:
group_keys = ["group0","group1"]
rev = df[df.relevant].groupby(group_keys).max()
nrev = df[~df.relevant].groupby(group_keys).max()
merged = rev.merge(nrev, on=group_keys, how='outer')
merged['value'] = merged.value_x.where(merged.relevant_x, merged.value_y )
merged
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论