Pandas 使用 sum() 时会引发错误。

huangapple go评论125阅读模式
英文:

Pandas groupby throws an error when using sum()

问题

我正在尝试计算簇间散布矩阵。为了做到这一点,对于每个簇(在下面的示例中称为“group”),我需要执行一个操作,该操作会生成一个矩阵,然后对来自每个簇的矩阵进行逐元素相加。

为了做到这一点,我尝试以下操作:

  1. import pandas as pd
  2. import numpy as np
  3. df = pd.DataFrame({'group': [1, 2, 1, 0, 0, 0, 1, 2],
  4. 'A': [1.5, 0.5, 2.5, 0.5, 1.5, 0.5, 1.5, 0.5],
  5. 'B': [3.5, 2.5, 3.5, 2.5, 3.5, 2.5, 3.5, 2.5]})
  6. features = list(df.columns)
  7. features.remove('group')
  8. def g(x, mu):
  9. y = np.array([np.mean(x) - mu])
  10. print((y.T @ y)*len(x))
  11. print("")
  12. return (y.T @ y)*len(x)
  13. m = len(df.index)
  14. mu = df.groupby('group')[features].apply(lambda x: (np.multiply(x.count(), np.mean(x)))/m).sum()
  15. print("mu:")
  16. print(mu)
  17. Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()

这个示例在最后一行抛出一个TypeError: Series.name must be a hashable type错误。g函数中的打印语句显示了预期的结果,如下所示,所以我认为错误是由于sum()操作引起的。

  1. [[0.2553 0.1458]
  2. [0.1458 0.0833]]
  3. [[1.5052 1.0625]
  4. [1.0625 0.75 ]]
  5. [[0.7912 0.625 ]
  6. [0.625 0.5 ]]

我预期通过添加sum()操作会得到上面三个矩阵的逐元素相加的结果。

期望的输出是:

  1. [[2.5416 1.8333]
  2. [1.8333 1.3333]]

对于为什么features会引发错误,您可以尝试以下更改:

使用以下代码:

  1. Sb = df.groupby('group').apply(g, mu=(mu)).sum()

而不是:

  1. Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()

这样可以得到正确的矩阵,但会有NaN值填充。为什么features会引发错误的原因可能与分组操作有关,可能是因为features在分组操作中引发了错误。

英文:

I am trying to calculate the between cluster scatter matrix. In order to do that, for each cluster (named "group" in the example below), I need to perform an operation which results in a matrix and subsequently perform an element-wise addition of the matrices from each cluster.

To do this I try the following:

  1. import pandas as pd
  2. import numpy as np
  3. df = pd.DataFrame({'group': [1, 2, 1, 0, 0, 0, 1, 2],
  4. 'A': [1.5, 0.5, 2.5, 0.5, 1.5, 0.5, 1.5, 0.5],
  5. 'B': [3.5, 2.5, 3.5, 2.5, 3.5, 2.5, 3.5, 2.5]})
  6. features = list(df.columns)
  7. features.remove('group')
  8. def g(x, mu):
  9. y = np.array([np.mean(x) - mu])
  10. print((y.T @ y)*len(x))
  11. print("")
  12. return (y.T @ y)*len(x)
  13. m = len(df.index)
  14. mu = df.groupby('group')[features].apply(lambda x: (np.multiply(x.count(), np.mean(x)))/m).sum()
  15. print("mu:")
  16. print(mu)
  17. Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()

This example throws a TypeError: Series.name must be a hashable type error on the last line. The print statement in the g function shows the result as expected, see below, so I believe the error is due to the .sum() operation.

  1. [[0.2553 0.1458]
  2. [0.1458 0.0833]]
  3. [[1.5052 1.0625]
  4. [1.0625 0.75 ]]
  5. [[0.7912 0.625 ]
  6. [0.625 0.5 ]]

The result I was expecting by adding the .sum() operation was the element-wise addition of the three matrices above.

The expected output is:

  1. [[2.5416 1.8333]
  2. [1.8333 1.3333]]

Any ideas why this is giving me an error and what I can do to correct it?

Update 1:
Using:

  1. Sb = df.groupby('group').apply(g, mu=(mu)).sum()

instead of

  1. Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()

gives the correct matrix, padded with nans. Why does features cause an error?

答案1

得分: 3

你尝试过这个吗?

  1. sb = df.groupby('group').apply(g, mu=(mu)).sum()

它会产生以下结果:

  1. [[2.54166667 1.83333333 nan]
  2. [1.83333333 1.33333333 nan]
  3. [ nan nan nan]]

这是你想要的吗?

但是你仍然需要处理NaN值。

编辑以回答你的评论:

为了回答你在评论中提到的问题,你可以将你的函数修改如下:

  1. def g(x, mu):
  2. x = x[["A", "B"]] # 或者 x = x[features]
  3. y = np.array([np.mean(x) - mu])
  4. print((y.T @ y) * (len(x)))
  5. print("")
  6. return (y.T @ y) * (len(x))

然后:

  1. sb = df.groupby(['group']).apply(g, mu=(mu)).sum()
  2. print(sb)

这会得到:

  1. [[2.54166667 1.83333333]
  2. [1.83333333 1.33333333]]
英文:

Have you tried this ?

  1. sb=df.groupby('group').apply(g, mu=(mu)).sum()

it gives the following result:

  1. [[2.54166667 1.83333333 nan]
  2. [1.83333333 1.33333333 nan]
  3. [ nan nan nan]]

Is it what you want ?

You still have to deal with the nans though

Edit to answer your comments:

To answer you problem in the comments you could change your function as below:

  1. def g(x, mu):
  2. x=x[["A","B"]] #or x=x[features]
  3. y = np.array([np.mean(x) - mu])
  4. print((y.T @ y)*(len(x)))
  5. print("")
  6. return (y.T @ y)*(len(x))

and then:

  1. sb=df.groupby(['group']).apply(g, mu=(mu)).sum()
  2. print(sb)

which gives:

  1. [[2.54166667 1.83333333]
  2. [1.83333333 1.33333333]]

huangapple
  • 本文由 发表于 2020年1月6日 20:16:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/59611960.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定