英文:
Pandas groupby throws an error when using sum()
问题
我正在尝试计算簇间散布矩阵。为了做到这一点,对于每个簇(在下面的示例中称为“group”),我需要执行一个操作,该操作会生成一个矩阵,然后对来自每个簇的矩阵进行逐元素相加。
为了做到这一点,我尝试以下操作:
import pandas as pd
import numpy as np
df = pd.DataFrame({'group': [1, 2, 1, 0, 0, 0, 1, 2],
'A': [1.5, 0.5, 2.5, 0.5, 1.5, 0.5, 1.5, 0.5],
'B': [3.5, 2.5, 3.5, 2.5, 3.5, 2.5, 3.5, 2.5]})
features = list(df.columns)
features.remove('group')
def g(x, mu):
y = np.array([np.mean(x) - mu])
print((y.T @ y)*len(x))
print("")
return (y.T @ y)*len(x)
m = len(df.index)
mu = df.groupby('group')[features].apply(lambda x: (np.multiply(x.count(), np.mean(x)))/m).sum()
print("mu:")
print(mu)
Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()
这个示例在最后一行抛出一个TypeError: Series.name must be a hashable type
错误。g
函数中的打印语句显示了预期的结果,如下所示,所以我认为错误是由于sum()
操作引起的。
[[0.2553 0.1458]
[0.1458 0.0833]]
[[1.5052 1.0625]
[1.0625 0.75 ]]
[[0.7912 0.625 ]
[0.625 0.5 ]]
我预期通过添加sum()
操作会得到上面三个矩阵的逐元素相加的结果。
期望的输出是:
[[2.5416 1.8333]
[1.8333 1.3333]]
对于为什么features
会引发错误,您可以尝试以下更改:
使用以下代码:
Sb = df.groupby('group').apply(g, mu=(mu)).sum()
而不是:
Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()
这样可以得到正确的矩阵,但会有NaN值填充。为什么features
会引发错误的原因可能与分组操作有关,可能是因为features
在分组操作中引发了错误。
英文:
I am trying to calculate the between cluster scatter matrix. In order to do that, for each cluster (named "group" in the example below), I need to perform an operation which results in a matrix and subsequently perform an element-wise addition of the matrices from each cluster.
To do this I try the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'group': [1, 2, 1, 0, 0, 0, 1, 2],
'A': [1.5, 0.5, 2.5, 0.5, 1.5, 0.5, 1.5, 0.5],
'B': [3.5, 2.5, 3.5, 2.5, 3.5, 2.5, 3.5, 2.5]})
features = list(df.columns)
features.remove('group')
def g(x, mu):
y = np.array([np.mean(x) - mu])
print((y.T @ y)*len(x))
print("")
return (y.T @ y)*len(x)
m = len(df.index)
mu = df.groupby('group')[features].apply(lambda x: (np.multiply(x.count(), np.mean(x)))/m).sum()
print("mu:")
print(mu)
Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()
This example throws a TypeError: Series.name must be a hashable type
error on the last line. The print statement in the g
function shows the result as expected, see below, so I believe the error is due to the .sum()
operation.
[[0.2553 0.1458]
[0.1458 0.0833]]
[[1.5052 1.0625]
[1.0625 0.75 ]]
[[0.7912 0.625 ]
[0.625 0.5 ]]
The result I was expecting by adding the .sum()
operation was the element-wise addition of the three matrices above.
The expected output is:
[[2.5416 1.8333]
[1.8333 1.3333]]
Any ideas why this is giving me an error and what I can do to correct it?
Update 1:
Using:
Sb = df.groupby('group').apply(g, mu=(mu)).sum()
instead of
Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()
gives the correct matrix, padded with nans. Why does features
cause an error?
答案1
得分: 3
你尝试过这个吗?
sb = df.groupby('group').apply(g, mu=(mu)).sum()
它会产生以下结果:
[[2.54166667 1.83333333 nan]
[1.83333333 1.33333333 nan]
[ nan nan nan]]
这是你想要的吗?
但是你仍然需要处理NaN值。
编辑以回答你的评论:
为了回答你在评论中提到的问题,你可以将你的函数修改如下:
def g(x, mu):
x = x[["A", "B"]] # 或者 x = x[features]
y = np.array([np.mean(x) - mu])
print((y.T @ y) * (len(x)))
print("")
return (y.T @ y) * (len(x))
然后:
sb = df.groupby(['group']).apply(g, mu=(mu)).sum()
print(sb)
这会得到:
[[2.54166667 1.83333333]
[1.83333333 1.33333333]]
英文:
Have you tried this ?
sb=df.groupby('group').apply(g, mu=(mu)).sum()
it gives the following result:
[[2.54166667 1.83333333 nan]
[1.83333333 1.33333333 nan]
[ nan nan nan]]
Is it what you want ?
You still have to deal with the nans though
Edit to answer your comments:
To answer you problem in the comments you could change your function as below:
def g(x, mu):
x=x[["A","B"]] #or x=x[features]
y = np.array([np.mean(x) - mu])
print((y.T @ y)*(len(x)))
print("")
return (y.T @ y)*(len(x))
and then:
sb=df.groupby(['group']).apply(g, mu=(mu)).sum()
print(sb)
which gives:
[[2.54166667 1.83333333]
[1.83333333 1.33333333]]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论