2023年5月22日 23:42:01go评论122阅读模式

英文:

Why does pandas not use NumPy correlation method?

问题

最近我意识到NumPy的相关函数比与pandas进行比较要快得多。

如果我对大约18,000个特征执行成对相关性计算，使用NumPy要快100倍。

%timeit np.corrcoef(df.values)
5.17秒±0毫秒每次循环（1次运行，1次循环的平均值±标准偏差）

%timeit df.T.corr()
8分钟49秒±0毫秒每次循环（1次运行，1次循环的平均值±标准偏差）

为什么他们不直接使用NumPy呢？我已经检查了两者的源代码。NumPy使用矢量化，而pandas更倾向于使用循环，这使得速度更慢。

英文:

Recently I have realised that NumPy correlation function is much faster then comparing to pandas.

If I perform pair-wise correlation to the ~18k features, with NumPy It is 100x time faster.

%timeit np.corrcoef(df.values)
5.17 s &#177; 0 ns per loop (mean &#177; std. dev. of 1 run, 1 loop each)
%timeit df.T.corr()
8min 49s &#177; 0 ns per loop (mean &#177; std. dev. of 1 run, 1 loop each)

Why they don't just use NumPy for that? I have checked both source code. NumPy use vectorization for that, pandas prefers loops which makes more slower.

答案1

得分: 2

差异与numpy是否比pandas或编译代码更快无关。Pandas使用了一个用Cython编写的相关系数实现，这不能解释性能差异的如此之大。

主要区别在于pandas采用了一种处理NaN值的动态方法。这意味着如果单个列包含NaN值，计算相关系数时将忽略该值，以及其他列中对应的值也将被忽略。

在上面的链接中，你可以看到它们使用Cython来遍历每一对比较中的所有值。然后在每个比较中，它将检查是否存在任何NaN值，然后将它们添加到描述性统计信息中。

为了确保NaN可以被忽略，它们使用在线计算描述性统计信息（均值、标准差、SSE；协方差和相关性的所有组件）来防止NaN值周围的大量数据复制。

这种在线计算这些描述性统计信息的方法速度相对较慢，但在处理NaN时非常灵活。

与此不同，NumPy允许NaN值的传播（这是标准行为）。因此，如果列中的单个值有NaN，那么与该列的所有相关性也将变为NaN。这意味着NumPy方法在计算相关系数时不需要执行任何（昂贵的）在线统计。

import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
df = pd.DataFrame(rng.normal(0, 3, size=(10, 5)))
df.loc[3, 2] = np.nan
print(df.corr()) # nans ignored
          0         1         2         3         4
0  1.000000 -0.090192  0.517099  0.358564 -0.220882
1 -0.090192  1.000000 -0.560877 -0.539464  0.260559
2  0.517099 -0.560877  1.000000  0.504788 -0.465717
3  0.358564 -0.539464  0.504788  1.000000 -0.524223
4 -0.220882  0.260559 -0.465717 -0.524223  1.000000
print(np.corrcoef(df, rowvar=False)) # nan values proliferate
array([[ 1.        , -0.09019204,         nan,  0.35856406, -0.22088229],
       [-0.09019204,  1.        ,         nan, -0.53946446,  0.26055938],
       [        nan,         nan,         nan,         nan,         nan],
       [ 0.35856406, -0.53946446,         nan,  1.        , -0.52422259],
       [-0.22088229,  0.26055938,         nan, -0.52422259,  1.        ]])

请注意，以上内容已经翻译完成。

英文:

The difference does not have to do with numpy being necessarily faster than pandas or compiled code etc. Pandas uses an implementation for the correlation coefficient that is written in Cython which would NOT explain the difference in performance of this magnitude.

The main difference is that pandas uses an approach to allow a dynamic handling of NaN values. Meaning if a single column contains a NaN value, then that value will be ignored as will the corresponding value in other columns when computing the correlation coefficient.

https://github.com/pandas-dev/pandas/blob/3e913c27faeb76cf3c6f01e688a2eeee3762980f/pandas/_libs/algos.pyx#L343-L394

In the link above you can see that they iterate (in cython) through ALL values in each of the pairwise comparisons. Then within each comparison it will check if any values are nan before adding them to the descriptives.

To ensure nans can be ignored, they use an online calculation of the descriptives (mean, stdev, sse; all components of covariance & correlation) to prevent a large amount of data copying around the NaN values.

The online computation of these descriptives is going to be a fairly slow approach, but is a flexible approach when dealing with NaNs.

Whereas NumPy allows the proliferation of NaNs (which is standard behavior). So if a single value in a column has a NaN, then all correlations with that column will also become NaN. This means that the numpy approach does not need to perform any (costly) online stats when calculating the correlation coefficients.

import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
df = pd.DataFrame(rng.normal(0, 3, size=(10, 5)))
df.loc[3, 2] = np.nan
print(df.corr()) # nans ignored
          0         1         2         3         4
0  1.000000 -0.090192  0.517099  0.358564 -0.220882
1 -0.090192  1.000000 -0.560877 -0.539464  0.260559
2  0.517099 -0.560877  1.000000  0.504788 -0.465717
3  0.358564 -0.539464  0.504788  1.000000 -0.524223
4 -0.220882  0.260559 -0.465717 -0.524223  1.000000
print(np.corrcoef(df, rowvar=False)) # nan values proliferate
array([[ 1.        , -0.09019204,         nan,  0.35856406, -0.22088229],
       [-0.09019204,  1.        ,         nan, -0.53946446,  0.26055938],
       [        nan,         nan,         nan,         nan,         nan],
       [ 0.35856406, -0.53946446,         nan,  1.        , -0.52422259],
       [-0.22088229,  0.26055938,         nan, -0.52422259,  1.        ]])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么pandas不使用NumPy的相关性方法？

问题

答案1

Facing an issue while creating the pivot table using Pandas with Data frames. tried with xlwings , but getting different error regarding number format

访问Shadow DOM树在Python Selenium中存在问题。

把txt文件中的数据转成DataFrame。

How do I export audio stored in a numpy array in a lossy format like m4a?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。