英文:
SciPy.Stats.Zscore appears to be calculating the z-score subtly incorrectly
问题
我正在尝试使用scipy.stats为数据集计算z分数,并且遇到了一个非常奇怪且微妙的错误,我无法弄清楚。代码正在运行,但似乎产生的数据稍微偏差,我担心这可能会对我在规范化数据集上运行的PCA产生不利影响。
我有一个包含以下数据的列表:
mylist = [0.565, 0.629, 0.687, 0.797, 0.56, 0.722]
我使用scipy.stats运行以下命令对数据进行Z-score标准化:
import scipy.stats as scipy
zscore_list = scipy.zscore(mylist)
[-1.11793077, -0.36479846, 0.31772769, 1.61217384, -1.17676923, 0.72959692]
然而,当我手动计算相同的数据时,我得到了不同的结果:
import statistics as stats
for x in mylist:
print(str((x-stats.mean(mylist))/stats.stdev(mylist)))
结果:
-1.0205264990693814
-0.33301391022264026
0.29004437341971895
1.471706635500054
-1.074238420073032
0.6660278204452793
我尝试了各种方法来解决这个问题,包括将“mylist”转换为numpy数组,在“scipy.zscore”的调用中使用axis=None和ddof=0,但结果没有改变。
英文:
I am trying to calculate z-scores for a dataset using scipy.stats, and am running into a very weird subtle error that I cannot figure out. The code is running, but appears to be producing data that is slightly off, which I am concerned is adversely impacting a PCA that I am running on the normalized dataset.
I have the following data in a list:
mylist = [0.565, 0.629, 0.687, 0.797, 0.56, 0.722]
I run the following commands to Z-score normalize the data using scipy.stats:
import scipy.stats as scipy
zscore_list = scipy.zscore(mylist)
[-1.11793077, -0.36479846, 0.31772769, 1.61217384, -1.17676923, 0.72959692]
However, when I calculate the same data manually, I get a different result:
import statistics as stats
for x in mylist:
`print(str((x-stats.mean(mylist))/stats.stdev(mylist)))`
Result:
-1.0205264990693814
-0.33301391022264026
0.29004437341971895
1.471706635500054
-1.074238420073032
0.6660278204452793
I have tried various things to address the issue, including converting "mylist" into a numpy array, using axis=None and ddof=0 in the call to "scipy.zscore", and nothing changes the result.
答案1
得分: 4
UPDATE: 所以我弄清楚了,并且决定将这个留下来供有兴趣的人参考。
计算标准偏差有两种方法:总体标准偏差,它通过总体大小 N 来除以均值的平方差之和,或者样本标准偏差,它通过样本大小 n-1 来除以均值的平方差之和。
事实证明,NumPy(因此也包括 SciPy)默认使用总体标准偏差。这很少是合适的;对于大多数需要从样本中推断总体统计数据的应用,样本标准偏差更合适。为了纠正这一点,我在调用 scipy.stats.zscore 时设置了 ddof=1
,这解决了计算错误的问题。
希望这对某人有所帮助!
英文:
UPDATE: So I figured it out, and figured I would leave this up for anyone interested.
There are two ways to calculate standard deviation: the population standard deviation, which divides the sum of the squared differences in the mean by the population size N, or the sample standard deviation, which divides the sum of the squared differences in the mean by the sample size n-1.
As it turns out, NumPy (and hence SciPy) uses the population standard deviation by default. This is rarely appropriate; for most applications where one is inferring population-level stats by drawing on a sample, the sample standard deviation is more appropriate. To correct this, I set ddof=1
in my call to scipy.stats.zscore, and that solved the problem of incorrect calculation.
I hope this is helpful to someone!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论