为什么pandas不使用NumPy的相关性方法?

huangapple go评论78阅读模式
英文:

Why does pandas not use NumPy correlation method?

问题

最近我意识到NumPy的相关函数比与pandas进行比较要快得多。

如果我对大约18,000个特征执行成对相关性计算,使用NumPy要快100倍。

%timeit np.corrcoef(df.values)
5.17秒±0毫秒每次循环(1次运行,1次循环的平均值±标准偏差)

%timeit df.T.corr()
8分钟49秒±0毫秒每次循环(1次运行,1次循环的平均值±标准偏差)

为什么他们不直接使用NumPy呢?我已经检查了两者的源代码。NumPy使用矢量化,而pandas更倾向于使用循环,这使得速度更慢。

英文:

Recently I have realised that NumPy correlation function is much faster then comparing to pandas.

If I perform pair-wise correlation to the ~18k features, with NumPy It is 100x time faster.

%timeit np.corrcoef(df.values)
5.17 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%timeit df.T.corr()
8min 49s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Why they don't just use NumPy for that? I have checked both source code. NumPy use vectorization for that, pandas prefers loops which makes more slower.

答案1

得分: 2

差异与numpy是否比pandas或编译代码更快无关。Pandas使用了一个用Cython编写的相关系数实现,这不能解释性能差异的如此之大。

主要区别在于pandas采用了一种处理NaN值的动态方法。这意味着如果单个列包含NaN值,计算相关系数时将忽略该值,以及其他列中对应的值也将被忽略。

在上面的链接中,你可以看到它们使用Cython来遍历每一对比较中的所有值。然后在每个比较中,它将检查是否存在任何NaN值,然后将它们添加到描述性统计信息中。

为了确保NaN可以被忽略,它们使用在线计算描述性统计信息(均值、标准差、SSE;协方差和相关性的所有组件)来防止NaN值周围的大量数据复制。

这种在线计算这些描述性统计信息的方法速度相对较慢,但在处理NaN时非常灵活。

与此不同,NumPy允许NaN值的传播(这是标准行为)。因此,如果列中的单个值有NaN,那么与该列的所有相关性也将变为NaN。这意味着NumPy方法在计算相关系数时不需要执行任何(昂贵的)在线统计。

import pandas as pd
import numpy as np

rng = np.random.default_rng(0)
df = pd.DataFrame(rng.normal(0, 3, size=(10, 5)))
df.loc[3, 2] = np.nan

print(df.corr()) # nans ignored
          0         1         2         3         4
0  1.000000 -0.090192  0.517099  0.358564 -0.220882
1 -0.090192  1.000000 -0.560877 -0.539464  0.260559
2  0.517099 -0.560877  1.000000  0.504788 -0.465717
3  0.358564 -0.539464  0.504788  1.000000 -0.524223
4 -0.220882  0.260559 -0.465717 -0.524223  1.000000

print(np.corrcoef(df, rowvar=False)) # nan values proliferate
array([[ 1.        , -0.09019204,         nan,  0.35856406, -0.22088229],
       [-0.09019204,  1.        ,         nan, -0.53946446,  0.26055938],
       [        nan,         nan,         nan,         nan,         nan],
       [ 0.35856406, -0.53946446,         nan,  1.        , -0.52422259],
       [-0.22088229,  0.26055938,         nan, -0.52422259,  1.        ]])

请注意,以上内容已经翻译完成。

英文:

The difference does not have to do with numpy being necessarily faster than pandas or compiled code etc. Pandas uses an implementation for the correlation coefficient that is written in Cython which would NOT explain the difference in performance of this magnitude.

The main difference is that pandas uses an approach to allow a dynamic handling of NaN values. Meaning if a single column contains a NaN value, then that value will be ignored as will the corresponding value in other columns when computing the correlation coefficient.

https://github.com/pandas-dev/pandas/blob/3e913c27faeb76cf3c6f01e688a2eeee3762980f/pandas/_libs/algos.pyx#L343-L394

In the link above you can see that they iterate (in cython) through ALL values in each of the pairwise comparisons. Then within each comparison it will check if any values are nan before adding them to the descriptives.

To ensure nans can be ignored, they use an online calculation of the descriptives (mean, stdev, sse; all components of covariance & correlation) to prevent a large amount of data copying around the NaN values.

The online computation of these descriptives is going to be a fairly slow approach, but is a flexible approach when dealing with NaNs.


Whereas NumPy allows the proliferation of NaNs (which is standard behavior). So if a single value in a column has a NaN, then all correlations with that column will also become NaN. This means that the numpy approach does not need to perform any (costly) online stats when calculating the correlation coefficients.

import pandas as pd
import numpy as np

rng = np.random.default_rng(0)
df = pd.DataFrame(rng.normal(0, 3, size=(10, 5)))
df.loc[3, 2] = np.nan

print(df.corr()) # nans ignored
          0         1         2         3         4
0  1.000000 -0.090192  0.517099  0.358564 -0.220882
1 -0.090192  1.000000 -0.560877 -0.539464  0.260559
2  0.517099 -0.560877  1.000000  0.504788 -0.465717
3  0.358564 -0.539464  0.504788  1.000000 -0.524223
4 -0.220882  0.260559 -0.465717 -0.524223  1.000000

print(np.corrcoef(df, rowvar=False)) # nan values proliferate
array([[ 1.        , -0.09019204,         nan,  0.35856406, -0.22088229],
       [-0.09019204,  1.        ,         nan, -0.53946446,  0.26055938],
       [        nan,         nan,         nan,         nan,         nan],
       [ 0.35856406, -0.53946446,         nan,  1.        , -0.52422259],
       [-0.22088229,  0.26055938,         nan, -0.52422259,  1.        ]])

huangapple
  • 本文由 发表于 2023年5月22日 23:42:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76307855.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定