英文:
Pandas DataFrame aggregation with a condition
问题
我想基于条件聚合DataFrame,我能够做到,但不适用于每种情况,这里是一个示例:
import pandas as pd
import numpy as np
th=0.5
tuples = list(zip(*[
["foo", "foo", "foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "bar", "bar"],
["A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"],
["one", "two", "three", "one", "two", "three", "one", "two", "three", "one", "two", "three"],
]))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second", "third"])
# 创建测试DataFrame和一个简单的副本
df1 = pd.DataFrame(np.array([0.04325033, 0.17730071, 0.05833607, 0.15995565, 0.48858341,
0.27516962, 0.26154782, 0.68857624, 0.75173469, 0.34701632,
0.10108433, 0.32280973]), index=index, columns=["data"])
df2 = df1.reset_index()[["first", "second", "data"]].groupby(["first", "second"]).sum().copy()
# 如果金额大于或等于阈值,则复制该值
df1.loc[df1['data'] >= th, 'test'] = df1.loc[df1['data'] >= th, 'data']
df2.loc[df2['data'] >= th, 'test'] = df2.loc[df2['data'] >= th, 'data']
# 如果金额小于阈值,则聚合该组中的所有这些值
df1.loc[df1['data'] < th, 'test'] = df1.loc[df1['data'] < th, 'data'].groupby(["first", "second"]).sum()
df2.loc[df2['data'] < th, 'test'] = df2.loc[df2['data'] < th, 'data'].groupby(["first"]).sum()
它产生的输出如下:
data test
first second third
foo A one 0.043250 0.278887
two 0.177301 0.278887
three 0.058336 0.278887
B one 0.159956 0.923709
two 0.488583 0.923709
three 0.275170 0.923709
bar A one 0.261548 0.261548
two 0.688576 0.688576
three 0.751735 0.751735
B one 0.347016 0.770910
two 0.101084 0.770910
three 0.322810 0.770910
data test
first second
bar A 1.701859 1.701859
B 0.770910 0.770910
foo A 0.278887 NaN
B 0.923709 0.923709
第一个DataFrame是预期的,但对于第二个DataFrame,在聚合应该发生的地方有np.nan
。为什么第一个有效而第二个无效?如何实现这一点?
奖励:
我想要一个具有n个聚合列的DataFrame,并对每个列执行从内部到外部的此过程,保留超过阈值的值并聚合小于阈值的所有值 - 有没有比for
循环更好的解决方案?
英文:
I want to aggregate DataFrame
based on a condition and I am able to do it but not for every scenario, here is an example:
import pandas as pd
import numpy as np
th=0.5
tuples = list(zip(*[
["foo", "foo", "foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "bar", "bar"],
["A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"],
["one", "two", "three", "one", "two", "three", "one", "two", "three", "one", "two", "three"],
]))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second", "third"])
# create test DataFrame and a simpler copy
df1 = pd.DataFrame(np.array([0.04325033, 0.17730071, 0.05833607, 0.15995565, 0.48858341,
0.27516962, 0.26154782, 0.68857624, 0.75173469, 0.34701632,
0.10108433, 0.32280973]), index=index, columns=["data"])
df2 = df1.reset_index()[["first", "second", "data"]].groupby(["first", "second"]).sum().copy()
# if the amount is greater or equal to the threshold copy the value
df1.loc[df1['data'] >= th, 'test'] = df1.loc[df1['data'] >= th, 'data']
df2.loc[df2['data'] >= th, 'test'] = df2.loc[df2['data'] >= th, 'data']
# if the amount is smaller than the threshold aggregate all such values for the group
df1.loc[df1['data'] < th, 'test'] = df1.loc[df1['data'] < th, 'data'].groupby(["first", "second"]).sum()
df2.loc[df2['data'] < th, 'test'] = df2.loc[df2['data'] < th, 'data'].groupby(["first"]).sum()
it gives the output
data test
first second third
foo A one 0.043250 0.278887
two 0.177301 0.278887
three 0.058336 0.278887
B one 0.159956 0.923709
two 0.488583 0.923709
three 0.275170 0.923709
bar A one 0.261548 0.261548
two 0.688576 0.688576
three 0.751735 0.751735
B one 0.347016 0.770910
two 0.101084 0.770910
three 0.322810 0.770910
data test
first second
bar A 1.701859 1.701859
B 0.770910 0.770910
foo A 0.278887 NaN
B 0.923709 0.923709
The first DataFrame
is as intended but for the second one there is np.nan
where the aggregation should be. Why the first works and the second does not? How can I achieve this?
Bonus:
I would like to have a DataFrame
with n aggregation columns and do this procedure for each of them from the innermost to the outermost keeping the value if it is over a threshold and aggregating all those smaller than the threshold - is there a better solution than a for
loop?
答案1
得分: 2
代码部分不需要翻译,以下是翻译好的部分:
The problem seems to be:
首先,不要传递一个包含一个元素的列表给 groupby
,这会引发一个 FutureWarning
:
未来版本的 pandas 中,当在具有一个与长度为1的列表相等的分组器上进行迭代时,将返回一个长度为1的元组。为了避免此警告,不要传递一个包含单个分组器的列表。
接下来,如果你按照一个级别进行分组,你必须将聚合操作广播到所有筛选的行以保持相同的维度(多级索引)
现在级别是相同的,所以你可以设置值:
输出:
data test
first second
bar A 1.701859 1.701859
B 0.770910 0.770910
foo A 0.278887 0.278887 # <- 在这里
B 0.923709 0.923709
英文:
The problem seems to be:
df2.loc[df2['data'] < th, 'data'].groupby(["first"]).sum()
First, don't pass a list of one element for groupby
, it raises a FutureWarning
:
> FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.
Next, if you group by one level, you have to broadcast the aggregated operation to all filtered rows to keep the same dimension (multilevel index)
>>> df2.loc[df2['data'] < th, 'data'].groupby('first').sum()
first
foo 0.278887
Name: data, dtype: float64
>>> df2.loc[df2['data'] < th, 'data'].groupby('first').transform('sum')
first second
foo A 0.278887
Name: data, dtype: float64
Now the levels are the same so you can set the values:
df2.loc[df2['data'] < th, 'test'] = \
df2.loc[df2['data'] < th, 'data'].groupby('first').transform('sum')
print(df2)
# Output
data test
first second
bar A 1.701859 1.701859
B 0.770910 0.770910
foo A 0.278887 0.278887 # <- HERE
B 0.923709 0.923709
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论