Pandas 使用 sum() 时会引发错误。

huangapple go评论102阅读模式
英文:

Pandas groupby throws an error when using sum()

问题

我正在尝试计算簇间散布矩阵。为了做到这一点,对于每个簇(在下面的示例中称为“group”),我需要执行一个操作,该操作会生成一个矩阵,然后对来自每个簇的矩阵进行逐元素相加。

为了做到这一点,我尝试以下操作:

import pandas as pd
import numpy as np

df = pd.DataFrame({'group': [1, 2, 1, 0, 0, 0, 1, 2],
                   'A': [1.5, 0.5, 2.5, 0.5, 1.5, 0.5, 1.5, 0.5],
                   'B': [3.5, 2.5, 3.5, 2.5, 3.5, 2.5, 3.5, 2.5]})
features = list(df.columns)
features.remove('group')

def g(x, mu):
    y = np.array([np.mean(x) - mu])
    print((y.T @ y)*len(x))
    print("")
    return (y.T @ y)*len(x)

m = len(df.index)
mu = df.groupby('group')[features].apply(lambda x: (np.multiply(x.count(), np.mean(x)))/m).sum()
print("mu:")
print(mu)

Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()

这个示例在最后一行抛出一个TypeError: Series.name must be a hashable type错误。g函数中的打印语句显示了预期的结果,如下所示,所以我认为错误是由于sum()操作引起的。

[[0.2553 0.1458]
 [0.1458 0.0833]]

[[1.5052 1.0625]
 [1.0625 0.75  ]]

[[0.7912 0.625 ]
 [0.625 0.5    ]]

我预期通过添加sum()操作会得到上面三个矩阵的逐元素相加的结果。

期望的输出是:

[[2.5416 1.8333]
 [1.8333 1.3333]]

对于为什么features会引发错误,您可以尝试以下更改:

使用以下代码:

Sb = df.groupby('group').apply(g, mu=(mu)).sum()

而不是:

Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()

这样可以得到正确的矩阵,但会有NaN值填充。为什么features会引发错误的原因可能与分组操作有关,可能是因为features在分组操作中引发了错误。

英文:

I am trying to calculate the between cluster scatter matrix. In order to do that, for each cluster (named "group" in the example below), I need to perform an operation which results in a matrix and subsequently perform an element-wise addition of the matrices from each cluster.

To do this I try the following:

import pandas as pd
import numpy as np

df = pd.DataFrame({'group': [1, 2, 1, 0, 0, 0, 1, 2],
                   'A': [1.5, 0.5, 2.5, 0.5, 1.5, 0.5, 1.5, 0.5],
                   'B': [3.5, 2.5, 3.5, 2.5, 3.5, 2.5, 3.5, 2.5]})
features = list(df.columns)
features.remove('group')


def g(x, mu):
    y = np.array([np.mean(x) - mu])
    print((y.T @ y)*len(x))
    print("")
    return (y.T @ y)*len(x)


m = len(df.index)
mu = df.groupby('group')[features].apply(lambda x: (np.multiply(x.count(), np.mean(x)))/m).sum()
print("mu:")
print(mu)

Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()

This example throws a TypeError: Series.name must be a hashable type error on the last line. The print statement in the g function shows the result as expected, see below, so I believe the error is due to the .sum() operation.

[[0.2553 0.1458]
 [0.1458 0.0833]]

[[1.5052 1.0625]
 [1.0625 0.75  ]]

[[0.7912 0.625 ]
 [0.625 0.5    ]]

The result I was expecting by adding the .sum() operation was the element-wise addition of the three matrices above.

The expected output is:

[[2.5416 1.8333]
 [1.8333 1.3333]]

Any ideas why this is giving me an error and what I can do to correct it?

Update 1:
Using:

Sb = df.groupby('group').apply(g, mu=(mu)).sum()

instead of

Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()

gives the correct matrix, padded with nans. Why does features cause an error?

答案1

得分: 3

你尝试过这个吗?

sb = df.groupby('group').apply(g, mu=(mu)).sum()

它会产生以下结果:

[[2.54166667 1.83333333        nan]
 [1.83333333 1.33333333        nan]
 [       nan        nan        nan]]

这是你想要的吗?

但是你仍然需要处理NaN值。

编辑以回答你的评论:

为了回答你在评论中提到的问题,你可以将你的函数修改如下:

def g(x, mu):
    x = x[["A", "B"]] # 或者 x = x[features]
    y = np.array([np.mean(x) - mu])
    print((y.T @ y) * (len(x)))
    print("")
    return (y.T @ y) * (len(x))

然后:

sb = df.groupby(['group']).apply(g, mu=(mu)).sum()
print(sb)

这会得到:

[[2.54166667 1.83333333]
 [1.83333333 1.33333333]]
英文:

Have you tried this ?

sb=df.groupby('group').apply(g, mu=(mu)).sum()

it gives the following result:

[[2.54166667 1.83333333        nan]
 [1.83333333 1.33333333        nan]
 [       nan        nan        nan]]

Is it what you want ?

You still have to deal with the nans though

Edit to answer your comments:

To answer you problem in the comments you could change your function as below:

def g(x, mu):
    x=x[["A","B"]] #or x=x[features]
    y = np.array([np.mean(x) - mu])
    print((y.T @ y)*(len(x)))
    print("")
    return (y.T @ y)*(len(x))

and then:

sb=df.groupby(['group']).apply(g, mu=(mu)).sum()
print(sb)

which gives:

[[2.54166667 1.83333333]
 [1.83333333 1.33333333]]

huangapple
  • 本文由 发表于 2020年1月6日 20:16:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/59611960.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定