2020年1月6日 20:16:22go评论125阅读模式

英文:

Pandas groupby throws an error when using sum()

问题

我正在尝试计算簇间散布矩阵。为了做到这一点，对于每个簇（在下面的示例中称为“group”），我需要执行一个操作，该操作会生成一个矩阵，然后对来自每个簇的矩阵进行逐元素相加。

为了做到这一点，我尝试以下操作：

import pandas as pd
import numpy as np
df = pd.DataFrame({'group': [1, 2, 1, 0, 0, 0, 1, 2],
                   'A': [1.5, 0.5, 2.5, 0.5, 1.5, 0.5, 1.5, 0.5],
                   'B': [3.5, 2.5, 3.5, 2.5, 3.5, 2.5, 3.5, 2.5]})
features = list(df.columns)
features.remove('group')
def g(x, mu):
    y = np.array([np.mean(x) - mu])
    print((y.T @ y)*len(x))
    print("")
    return (y.T @ y)*len(x)
m = len(df.index)
mu = df.groupby('group')[features].apply(lambda x: (np.multiply(x.count(), np.mean(x)))/m).sum()
print("mu:")
print(mu)
Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()

这个示例在最后一行抛出一个TypeError: Series.name must be a hashable type错误。g函数中的打印语句显示了预期的结果，如下所示，所以我认为错误是由于sum()操作引起的。

[[0.2553 0.1458]
 [0.1458 0.0833]]
[[1.5052 1.0625]
 [1.0625 0.75  ]]
[[0.7912 0.625 ]
 [0.625 0.5    ]]

我预期通过添加sum()操作会得到上面三个矩阵的逐元素相加的结果。

期望的输出是：

[[2.5416 1.8333]
 [1.8333 1.3333]]

对于为什么features会引发错误，您可以尝试以下更改：

使用以下代码：

Sb = df.groupby('group').apply(g, mu=(mu)).sum()

而不是：

Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()

这样可以得到正确的矩阵，但会有NaN值填充。为什么features会引发错误的原因可能与分组操作有关，可能是因为features在分组操作中引发了错误。

英文:

I am trying to calculate the between cluster scatter matrix. In order to do that, for each cluster (named "group" in the example below), I need to perform an operation which results in a matrix and subsequently perform an element-wise addition of the matrices from each cluster.

To do this I try the following:

import pandas as pd
import numpy as np
df = pd.DataFrame({&#39;group&#39;: [1, 2, 1, 0, 0, 0, 1, 2],
                   &#39;A&#39;: [1.5, 0.5, 2.5, 0.5, 1.5, 0.5, 1.5, 0.5],
                   &#39;B&#39;: [3.5, 2.5, 3.5, 2.5, 3.5, 2.5, 3.5, 2.5]})
features = list(df.columns)
features.remove(&#39;group&#39;)
def g(x, mu):
    y = np.array([np.mean(x) - mu])
    print((y.T @ y)*len(x))
    print(&quot;&quot;)
    return (y.T @ y)*len(x)
m = len(df.index)
mu = df.groupby(&#39;group&#39;)[features].apply(lambda x: (np.multiply(x.count(), np.mean(x)))/m).sum()
print(&quot;mu:&quot;)
print(mu)
Sb = df.groupby(&#39;group&#39;)[features].apply(g, mu=(mu)).sum()

This example throws a TypeError: Series.name must be a hashable type error on the last line. The print statement in the g function shows the result as expected, see below, so I believe the error is due to the .sum() operation.

[[0.2553 0.1458]
 [0.1458 0.0833]]
[[1.5052 1.0625]
 [1.0625 0.75  ]]
[[0.7912 0.625 ]
 [0.625 0.5    ]]

The result I was expecting by adding the .sum() operation was the element-wise addition of the three matrices above.

The expected output is:

[[2.5416 1.8333]
 [1.8333 1.3333]]

Any ideas why this is giving me an error and what I can do to correct it?

Update 1:
Using:

Sb = df.groupby(&#39;group&#39;).apply(g, mu=(mu)).sum()

instead of

Sb = df.groupby(&#39;group&#39;)[features].apply(g, mu=(mu)).sum()

gives the correct matrix, padded with nans. Why does features cause an error?

答案1

得分: 3

你尝试过这个吗？

sb = df.groupby('group').apply(g, mu=(mu)).sum()

它会产生以下结果：

[[2.54166667 1.83333333        nan]
 [1.83333333 1.33333333        nan]
 [       nan        nan        nan]]

这是你想要的吗？

但是你仍然需要处理NaN值。

编辑以回答你的评论：

为了回答你在评论中提到的问题，你可以将你的函数修改如下：

def g(x, mu):
    x = x[["A", "B"]] # 或者 x = x[features]
    y = np.array([np.mean(x) - mu])
    print((y.T @ y) * (len(x)))
    print("")
    return (y.T @ y) * (len(x))

然后：

sb = df.groupby(['group']).apply(g, mu=(mu)).sum()
print(sb)

这会得到：

[[2.54166667 1.83333333]
 [1.83333333 1.33333333]]

英文:

Have you tried this ?

sb=df.groupby(&#39;group&#39;).apply(g, mu=(mu)).sum()

it gives the following result:

[[2.54166667 1.83333333        nan]
 [1.83333333 1.33333333        nan]
 [       nan        nan        nan]]

Is it what you want ?

You still have to deal with the nans though

Edit to answer your comments:

To answer you problem in the comments you could change your function as below:

def g(x, mu):
    x=x[[&quot;A&quot;,&quot;B&quot;]] #or x=x[features]
    y = np.array([np.mean(x) - mu])
    print((y.T @ y)*(len(x)))
    print(&quot;&quot;)
    return (y.T @ y)*(len(x))

and then:

sb=df.groupby([&#39;group&#39;]).apply(g, mu=(mu)).sum()
print(sb)

which gives:

[[2.54166667 1.83333333]
 [1.83333333 1.33333333]]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas 使用 sum() 时会引发错误。

问题

答案1

email dataframe as table in mail body using python

无法在VS Code中调试测试案例：在”env”中发现重复项：”PATH”

显示最近X天内客户交易频率。

使用lmfit进行穆斯堡尔谱曲线拟合

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。