SciPy.Stats.Zscore 似乎在计算 z-分数时存在微小错误。

huangapple go评论62阅读模式
英文:

SciPy.Stats.Zscore appears to be calculating the z-score subtly incorrectly

问题

我正在尝试使用scipy.stats为数据集计算z分数,并且遇到了一个非常奇怪且微妙的错误,我无法弄清楚。代码正在运行,但似乎产生的数据稍微偏差,我担心这可能会对我在规范化数据集上运行的PCA产生不利影响。

我有一个包含以下数据的列表:

mylist = [0.565, 0.629, 0.687, 0.797, 0.56, 0.722]

我使用scipy.stats运行以下命令对数据进行Z-score标准化:

import scipy.stats as scipy
zscore_list = scipy.zscore(mylist)
[-1.11793077, -0.36479846, 0.31772769, 1.61217384, -1.17676923, 0.72959692]

然而,当我手动计算相同的数据时,我得到了不同的结果:

import statistics as stats
for x in mylist:
print(str((x-stats.mean(mylist))/stats.stdev(mylist)))

结果:

-1.0205264990693814
-0.33301391022264026
0.29004437341971895
1.471706635500054
-1.074238420073032
0.6660278204452793


我尝试了各种方法来解决这个问题,包括将“mylist”转换为numpy数组,在“scipy.zscore”的调用中使用axis=None和ddof=0,但结果没有改变。
英文:

I am trying to calculate z-scores for a dataset using scipy.stats, and am running into a very weird subtle error that I cannot figure out. The code is running, but appears to be producing data that is slightly off, which I am concerned is adversely impacting a PCA that I am running on the normalized dataset.

I have the following data in a list:

mylist = [0.565, 0.629, 0.687, 0.797, 0.56, 0.722]

I run the following commands to Z-score normalize the data using scipy.stats:

import scipy.stats as scipy
zscore_list = scipy.zscore(mylist)
[-1.11793077, -0.36479846,  0.31772769,  1.61217384, -1.17676923, 0.72959692]

However, when I calculate the same data manually, I get a different result:

import statistics as stats
for x in mylist:
`print(str((x-stats.mean(mylist))/stats.stdev(mylist)))`

Result:

-1.0205264990693814
-0.33301391022264026
0.29004437341971895
1.471706635500054
-1.074238420073032
0.6660278204452793

I have tried various things to address the issue, including converting "mylist" into a numpy array, using axis=None and ddof=0 in the call to "scipy.zscore", and nothing changes the result.

答案1

得分: 4

UPDATE: 所以我弄清楚了,并且决定将这个留下来供有兴趣的人参考。

计算标准偏差有两种方法:总体标准偏差,它通过总体大小 N 来除以均值的平方差之和,或者样本标准偏差,它通过样本大小 n-1 来除以均值的平方差之和。

事实证明,NumPy(因此也包括 SciPy)默认使用总体标准偏差。这很少是合适的;对于大多数需要从样本中推断总体统计数据的应用,样本标准偏差更合适。为了纠正这一点,我在调用 scipy.stats.zscore 时设置了 ddof=1,这解决了计算错误的问题。

希望这对某人有所帮助!

英文:

UPDATE: So I figured it out, and figured I would leave this up for anyone interested.

There are two ways to calculate standard deviation: the population standard deviation, which divides the sum of the squared differences in the mean by the population size N, or the sample standard deviation, which divides the sum of the squared differences in the mean by the sample size n-1.

As it turns out, NumPy (and hence SciPy) uses the population standard deviation by default. This is rarely appropriate; for most applications where one is inferring population-level stats by drawing on a sample, the sample standard deviation is more appropriate. To correct this, I set ddof=1 in my call to scipy.stats.zscore, and that solved the problem of incorrect calculation.

I hope this is helpful to someone!

huangapple
  • 本文由 发表于 2023年5月15日 05:02:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76249664.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定