2023年5月15日 05:02:30go评论86阅读模式

英文:

SciPy.Stats.Zscore appears to be calculating the z-score subtly incorrectly

问题

我正在尝试使用scipy.stats为数据集计算z分数，并且遇到了一个非常奇怪且微妙的错误，我无法弄清楚。代码正在运行，但似乎产生的数据稍微偏差，我担心这可能会对我在规范化数据集上运行的PCA产生不利影响。
我有一个包含以下数据的列表：

mylist = [0.565, 0.629, 0.687, 0.797, 0.56, 0.722]

我使用scipy.stats运行以下命令对数据进行Z-score标准化：

import scipy.stats as scipy
zscore_list = scipy.zscore(mylist)
[-1.11793077, -0.36479846, 0.31772769, 1.61217384, -1.17676923, 0.72959692]

然而，当我手动计算相同的数据时，我得到了不同的结果：

import statistics as stats
for x in mylist:
print(str((x-stats.mean(mylist))/stats.stdev(mylist)))

结果：

-1.0205264990693814
-0.33301391022264026
0.29004437341971895
1.471706635500054
-1.074238420073032
0.6660278204452793


我尝试了各种方法来解决这个问题，包括将“mylist”转换为numpy数组，在“scipy.zscore”的调用中使用axis=None和ddof=0，但结果没有改变。

英文:

I am trying to calculate z-scores for a dataset using scipy.stats, and am running into a very weird subtle error that I cannot figure out. The code is running, but appears to be producing data that is slightly off, which I am concerned is adversely impacting a PCA that I am running on the normalized dataset.

I have the following data in a list:

mylist = [0.565, 0.629, 0.687, 0.797, 0.56, 0.722]

I run the following commands to Z-score normalize the data using scipy.stats:

import scipy.stats as scipy
zscore_list = scipy.zscore(mylist)
[-1.11793077, -0.36479846,  0.31772769,  1.61217384, -1.17676923, 0.72959692]

However, when I calculate the same data manually, I get a different result:

import statistics as stats
for x in mylist:
`print(str((x-stats.mean(mylist))/stats.stdev(mylist)))`

Result:

-1.0205264990693814
-0.33301391022264026
0.29004437341971895
1.471706635500054
-1.074238420073032
0.6660278204452793

I have tried various things to address the issue, including converting "mylist" into a numpy array, using axis=None and ddof=0 in the call to "scipy.zscore", and nothing changes the result.

答案1

得分: 4

UPDATE: 所以我弄清楚了，并且决定将这个留下来供有兴趣的人参考。

计算标准偏差有两种方法：总体标准偏差，它通过总体大小 N 来除以均值的平方差之和，或者样本标准偏差，它通过样本大小 n-1 来除以均值的平方差之和。

事实证明，NumPy（因此也包括 SciPy）默认使用总体标准偏差。这很少是合适的；对于大多数需要从样本中推断总体统计数据的应用，样本标准偏差更合适。为了纠正这一点，我在调用 scipy.stats.zscore 时设置了 ddof=1，这解决了计算错误的问题。

希望这对某人有所帮助！

英文:

UPDATE: So I figured it out, and figured I would leave this up for anyone interested.

There are two ways to calculate standard deviation: the population standard deviation, which divides the sum of the squared differences in the mean by the population size N, or the sample standard deviation, which divides the sum of the squared differences in the mean by the sample size n-1.

As it turns out, NumPy (and hence SciPy) uses the population standard deviation by default. This is rarely appropriate; for most applications where one is inferring population-level stats by drawing on a sample, the sample standard deviation is more appropriate. To correct this, I set ddof=1 in my call to scipy.stats.zscore, and that solved the problem of incorrect calculation.

I hope this is helpful to someone!

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

SciPy.Stats.Zscore 似乎在计算 z-分数时存在微小错误。

问题

答案1

使用JAX在大型二维数组上查找最大的n个值

Azure Python SDK for DataFactory 指向特定的 GIT 分支

sklearn.decomposition.PCA 中 explained_variance_ 的用法

无法从Scrapy API获取数据

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。