Pandas DataFrame aggregation with a condition:Pandas数据帧按条件聚合

huangapple go评论112阅读模式
英文:

Pandas DataFrame aggregation with a condition

问题

我想基于条件聚合DataFrame,我能够做到,但不适用于每种情况,这里是一个示例:

import pandas as pd
import numpy as np

th=0.5
tuples = list(zip(*[
            ["foo", "foo", "foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "bar", "bar"],
            ["A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"],
            ["one", "two", "three", "one", "two", "three", "one", "two", "three", "one", "two", "three"],
        ]))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second", "third"])

# 创建测试DataFrame和一个简单的副本
df1 = pd.DataFrame(np.array([0.04325033, 0.17730071, 0.05833607, 0.15995565, 0.48858341,
       0.27516962, 0.26154782, 0.68857624, 0.75173469, 0.34701632,
       0.10108433, 0.32280973]), index=index, columns=["data"])
df2 = df1.reset_index()[["first", "second", "data"]].groupby(["first", "second"]).sum().copy()

# 如果金额大于或等于阈值,则复制该值
df1.loc[df1['data'] >= th, 'test'] = df1.loc[df1['data'] >= th, 'data']
df2.loc[df2['data'] >= th, 'test'] = df2.loc[df2['data'] >= th, 'data']

# 如果金额小于阈值,则聚合该组中的所有这些值
df1.loc[df1['data'] < th, 'test'] = df1.loc[df1['data'] < th, 'data'].groupby(["first", "second"]).sum()
df2.loc[df2['data'] < th, 'test'] = df2.loc[df2['data'] < th, 'data'].groupby(["first"]).sum()

它产生的输出如下:

                        data      test
first second third                    
foo   A      one    0.043250  0.278887
             two    0.177301  0.278887
             three  0.058336  0.278887
      B      one    0.159956  0.923709
             two    0.488583  0.923709
             three  0.275170  0.923709
bar   A      one    0.261548  0.261548
             two    0.688576  0.688576
             three  0.751735  0.751735
      B      one    0.347016  0.770910
             two    0.101084  0.770910
             three  0.322810  0.770910

                  data      test
first second                    
bar   A       1.701859  1.701859
      B       0.770910  0.770910
foo   A       0.278887       NaN
      B       0.923709  0.923709

第一个DataFrame是预期的,但对于第二个DataFrame,在聚合应该发生的地方有np.nan。为什么第一个有效而第二个无效?如何实现这一点?

奖励:
我想要一个具有n个聚合列的DataFrame,并对每个列执行从内部到外部的此过程,保留超过阈值的值并聚合小于阈值的所有值 - 有没有比for循环更好的解决方案?

英文:

I want to aggregate DataFrame based on a condition and I am able to do it but not for every scenario, here is an example:

import pandas as pd
import numpy as np

th=0.5
tuples = list(zip(*[
            [&quot;foo&quot;, &quot;foo&quot;, &quot;foo&quot;, &quot;foo&quot;, &quot;foo&quot;, &quot;foo&quot;, &quot;bar&quot;, &quot;bar&quot;, &quot;bar&quot;, &quot;bar&quot;, &quot;bar&quot;, &quot;bar&quot;],
            [&quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;B&quot;, &quot;B&quot;, &quot;B&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;B&quot;, &quot;B&quot;, &quot;B&quot;],
            [&quot;one&quot;, &quot;two&quot;, &quot;three&quot;, &quot;one&quot;, &quot;two&quot;, &quot;three&quot;, &quot;one&quot;, &quot;two&quot;, &quot;three&quot;, &quot;one&quot;, &quot;two&quot;, &quot;three&quot;],
        ]))
index = pd.MultiIndex.from_tuples(tuples, names=[&quot;first&quot;, &quot;second&quot;, &quot;third&quot;])

# create test DataFrame and a simpler copy
df1 = pd.DataFrame(np.array([0.04325033, 0.17730071, 0.05833607, 0.15995565, 0.48858341,
       0.27516962, 0.26154782, 0.68857624, 0.75173469, 0.34701632,
       0.10108433, 0.32280973]), index=index, columns=[&quot;data&quot;])
df2 = df1.reset_index()[[&quot;first&quot;, &quot;second&quot;, &quot;data&quot;]].groupby([&quot;first&quot;, &quot;second&quot;]).sum().copy()

# if the amount is greater or equal to the threshold copy the value
df1.loc[df1[&#39;data&#39;] &gt;= th, &#39;test&#39;] = df1.loc[df1[&#39;data&#39;] &gt;= th, &#39;data&#39;]
df2.loc[df2[&#39;data&#39;] &gt;= th, &#39;test&#39;] = df2.loc[df2[&#39;data&#39;] &gt;= th, &#39;data&#39;]

# if the amount is smaller than the threshold aggregate all such values for the group
df1.loc[df1[&#39;data&#39;] &lt; th, &#39;test&#39;] = df1.loc[df1[&#39;data&#39;] &lt; th, &#39;data&#39;].groupby([&quot;first&quot;, &quot;second&quot;]).sum()
df2.loc[df2[&#39;data&#39;] &lt; th, &#39;test&#39;] = df2.loc[df2[&#39;data&#39;] &lt; th, &#39;data&#39;].groupby([&quot;first&quot;]).sum()

it gives the output

                        data      test
first second third                    
foo   A      one    0.043250  0.278887
             two    0.177301  0.278887
             three  0.058336  0.278887
      B      one    0.159956  0.923709
             two    0.488583  0.923709
             three  0.275170  0.923709
bar   A      one    0.261548  0.261548
             two    0.688576  0.688576
             three  0.751735  0.751735
      B      one    0.347016  0.770910
             two    0.101084  0.770910
             three  0.322810  0.770910

                  data      test
first second                    
bar   A       1.701859  1.701859
      B       0.770910  0.770910
foo   A       0.278887       NaN
      B       0.923709  0.923709

The first DataFrame is as intended but for the second one there is np.nan where the aggregation should be. Why the first works and the second does not? How can I achieve this?

Bonus:
I would like to have a DataFrame with n aggregation columns and do this procedure for each of them from the innermost to the outermost keeping the value if it is over a threshold and aggregating all those smaller than the threshold - is there a better solution than a for loop?

答案1

得分: 2

代码部分不需要翻译,以下是翻译好的部分:

The problem seems to be:

首先,不要传递一个包含一个元素的列表给 groupby,这会引发一个 FutureWarning

未来版本的 pandas 中,当在具有一个与长度为1的列表相等的分组器上进行迭代时,将返回一个长度为1的元组。为了避免此警告,不要传递一个包含单个分组器的列表。

接下来,如果你按照一个级别进行分组,你必须将聚合操作广播到所有筛选的行以保持相同的维度(多级索引)

现在级别是相同的,所以你可以设置值:

输出:

                  data      test
first second                    
bar   A       1.701859  1.701859
      B       0.770910  0.770910
foo   A       0.278887  0.278887  # <- 在这里
      B       0.923709  0.923709
英文:

The problem seems to be:

df2.loc[df2[&#39;data&#39;] &lt; th, &#39;data&#39;].groupby([&quot;first&quot;]).sum()

First, don't pass a list of one element for groupby, it raises a FutureWarning:

> FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.

Next, if you group by one level, you have to broadcast the aggregated operation to all filtered rows to keep the same dimension (multilevel index)

&gt;&gt;&gt; df2.loc[df2[&#39;data&#39;] &lt; th, &#39;data&#39;].groupby(&#39;first&#39;).sum()
first
foo    0.278887
Name: data, dtype: float64

&gt;&gt;&gt; df2.loc[df2[&#39;data&#39;] &lt; th, &#39;data&#39;].groupby(&#39;first&#39;).transform(&#39;sum&#39;)
first  second
foo    A         0.278887
Name: data, dtype: float64

Now the levels are the same so you can set the values:

df2.loc[df2[&#39;data&#39;] &lt; th, &#39;test&#39;] = \
    df2.loc[df2[&#39;data&#39;] &lt; th, &#39;data&#39;].groupby(&#39;first&#39;).transform(&#39;sum&#39;)
print(df2)

# Output
                  data      test
first second                    
bar   A       1.701859  1.701859
      B       0.770910  0.770910
foo   A       0.278887  0.278887  # &lt;- HERE
      B       0.923709  0.923709

huangapple
  • 本文由 发表于 2023年3月7日 02:05:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/75654322.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定