2023年3月7日 02:05:34go评论145阅读模式

英文:

Pandas DataFrame aggregation with a condition

问题

我想基于条件聚合DataFrame，我能够做到，但不适用于每种情况，这里是一个示例：

import pandas as pd
import numpy as np
th=0.5
tuples = list(zip(*[
            ["foo", "foo", "foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "bar", "bar"],
            ["A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"],
            ["one", "two", "three", "one", "two", "three", "one", "two", "three", "one", "two", "three"],
        ]))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second", "third"])
# 创建测试DataFrame和一个简单的副本
df1 = pd.DataFrame(np.array([0.04325033, 0.17730071, 0.05833607, 0.15995565, 0.48858341,
       0.27516962, 0.26154782, 0.68857624, 0.75173469, 0.34701632,
       0.10108433, 0.32280973]), index=index, columns=["data"])
df2 = df1.reset_index()[["first", "second", "data"]].groupby(["first", "second"]).sum().copy()
# 如果金额大于或等于阈值，则复制该值
df1.loc[df1['data'] >= th, 'test'] = df1.loc[df1['data'] >= th, 'data']
df2.loc[df2['data'] >= th, 'test'] = df2.loc[df2['data'] >= th, 'data']
# 如果金额小于阈值，则聚合该组中的所有这些值
df1.loc[df1['data'] < th, 'test'] = df1.loc[df1['data'] < th, 'data'].groupby(["first", "second"]).sum()
df2.loc[df2['data'] < th, 'test'] = df2.loc[df2['data'] < th, 'data'].groupby(["first"]).sum()

它产生的输出如下：

                        data      test
first second third                    
foo   A      one    0.043250  0.278887
             two    0.177301  0.278887
             three  0.058336  0.278887
      B      one    0.159956  0.923709
             two    0.488583  0.923709
             three  0.275170  0.923709
bar   A      one    0.261548  0.261548
             two    0.688576  0.688576
             three  0.751735  0.751735
      B      one    0.347016  0.770910
             two    0.101084  0.770910
             three  0.322810  0.770910
                  data      test
first second                    
bar   A       1.701859  1.701859
      B       0.770910  0.770910
foo   A       0.278887       NaN
      B       0.923709  0.923709

第一个DataFrame是预期的，但对于第二个DataFrame，在聚合应该发生的地方有np.nan。为什么第一个有效而第二个无效？如何实现这一点？

奖励：
我想要一个具有n个聚合列的DataFrame，并对每个列执行从内部到外部的此过程，保留超过阈值的值并聚合小于阈值的所有值 - 有没有比for循环更好的解决方案？

英文:

I want to aggregate DataFrame based on a condition and I am able to do it but not for every scenario, here is an example:

import pandas as pd
import numpy as np
th=0.5
tuples = list(zip(*[
            [&quot;foo&quot;, &quot;foo&quot;, &quot;foo&quot;, &quot;foo&quot;, &quot;foo&quot;, &quot;foo&quot;, &quot;bar&quot;, &quot;bar&quot;, &quot;bar&quot;, &quot;bar&quot;, &quot;bar&quot;, &quot;bar&quot;],
            [&quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;B&quot;, &quot;B&quot;, &quot;B&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;B&quot;, &quot;B&quot;, &quot;B&quot;],
            [&quot;one&quot;, &quot;two&quot;, &quot;three&quot;, &quot;one&quot;, &quot;two&quot;, &quot;three&quot;, &quot;one&quot;, &quot;two&quot;, &quot;three&quot;, &quot;one&quot;, &quot;two&quot;, &quot;three&quot;],
        ]))
index = pd.MultiIndex.from_tuples(tuples, names=[&quot;first&quot;, &quot;second&quot;, &quot;third&quot;])
# create test DataFrame and a simpler copy
df1 = pd.DataFrame(np.array([0.04325033, 0.17730071, 0.05833607, 0.15995565, 0.48858341,
       0.27516962, 0.26154782, 0.68857624, 0.75173469, 0.34701632,
       0.10108433, 0.32280973]), index=index, columns=[&quot;data&quot;])
df2 = df1.reset_index()[[&quot;first&quot;, &quot;second&quot;, &quot;data&quot;]].groupby([&quot;first&quot;, &quot;second&quot;]).sum().copy()
# if the amount is greater or equal to the threshold copy the value
df1.loc[df1[&#39;data&#39;] &gt;= th, &#39;test&#39;] = df1.loc[df1[&#39;data&#39;] &gt;= th, &#39;data&#39;]
df2.loc[df2[&#39;data&#39;] &gt;= th, &#39;test&#39;] = df2.loc[df2[&#39;data&#39;] &gt;= th, &#39;data&#39;]
# if the amount is smaller than the threshold aggregate all such values for the group
df1.loc[df1[&#39;data&#39;] &lt; th, &#39;test&#39;] = df1.loc[df1[&#39;data&#39;] &lt; th, &#39;data&#39;].groupby([&quot;first&quot;, &quot;second&quot;]).sum()
df2.loc[df2[&#39;data&#39;] &lt; th, &#39;test&#39;] = df2.loc[df2[&#39;data&#39;] &lt; th, &#39;data&#39;].groupby([&quot;first&quot;]).sum()

it gives the output

                        data      test
first second third                    
foo   A      one    0.043250  0.278887
             two    0.177301  0.278887
             three  0.058336  0.278887
      B      one    0.159956  0.923709
             two    0.488583  0.923709
             three  0.275170  0.923709
bar   A      one    0.261548  0.261548
             two    0.688576  0.688576
             three  0.751735  0.751735
      B      one    0.347016  0.770910
             two    0.101084  0.770910
             three  0.322810  0.770910
                  data      test
first second                    
bar   A       1.701859  1.701859
      B       0.770910  0.770910
foo   A       0.278887       NaN
      B       0.923709  0.923709

The first DataFrame is as intended but for the second one there is np.nan where the aggregation should be. Why the first works and the second does not? How can I achieve this?

Bonus:
I would like to have a DataFrame with n aggregation columns and do this procedure for each of them from the innermost to the outermost keeping the value if it is over a threshold and aggregating all those smaller than the threshold - is there a better solution than a for loop?

答案1

得分: 2

代码部分不需要翻译，以下是翻译好的部分：

The problem seems to be:

首先，不要传递一个包含一个元素的列表给 groupby，这会引发一个 FutureWarning：

未来版本的 pandas 中，当在具有一个与长度为1的列表相等的分组器上进行迭代时，将返回一个长度为1的元组。为了避免此警告，不要传递一个包含单个分组器的列表。

接下来，如果你按照一个级别进行分组，你必须将聚合操作广播到所有筛选的行以保持相同的维度（多级索引）

现在级别是相同的，所以你可以设置值：

输出：

                  data      test
first second                    
bar   A       1.701859  1.701859
      B       0.770910  0.770910
foo   A       0.278887  0.278887  # <- 在这里
      B       0.923709  0.923709

英文:

The problem seems to be:

df2.loc[df2[&#39;data&#39;] &lt; th, &#39;data&#39;].groupby([&quot;first&quot;]).sum()

First, don't pass a list of one element for groupby, it raises a FutureWarning:

> FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.

Next, if you group by one level, you have to broadcast the aggregated operation to all filtered rows to keep the same dimension (multilevel index)

&gt;&gt;&gt; df2.loc[df2[&#39;data&#39;] &lt; th, &#39;data&#39;].groupby(&#39;first&#39;).sum()
first
foo    0.278887
Name: data, dtype: float64
&gt;&gt;&gt; df2.loc[df2[&#39;data&#39;] &lt; th, &#39;data&#39;].groupby(&#39;first&#39;).transform(&#39;sum&#39;)
first  second
foo    A         0.278887
Name: data, dtype: float64

Now the levels are the same so you can set the values:

df2.loc[df2[&#39;data&#39;] &lt; th, &#39;test&#39;] = \
    df2.loc[df2[&#39;data&#39;] &lt; th, &#39;data&#39;].groupby(&#39;first&#39;).transform(&#39;sum&#39;)
print(df2)
# Output
                  data      test
first second                    
bar   A       1.701859  1.701859
      B       0.770910  0.770910
foo   A       0.278887  0.278887  # &lt;- HERE
      B       0.923709  0.923709

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas DataFrame aggregation with a condition：Pandas数据帧按条件聚合

问题

答案1

合并两个数据框，如果一个字符串列表匹配，则将不匹配的字符串列为NA。

plt.scatter图绘制为空白。

Python递归调用开销 – 在达到setrecursionlimit指定限制之前的RecursionError结果

Casting to unit-less dtype ‘datetime64’ is not supported.

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。