2023年6月15日 13:05:44go评论109阅读模式

英文:

Usage of explained_variance_ for sklearn.decomposition.PCA

问题

我正在进行一些主成分分析，偶然遇到了获取特征向量和特征值的几种不同方法。

具体而言，在scipy.sparse.linalg中有一种eigs方法，在sklearn.decomposition.PCA()中，我还可以通过访问explained_variance_属性来获取特征值。

然而，我已经运行了几次，发现特征值存在一些不匹配。我明白特征向量可能不同，因为它们可以是标量倍数，但我不明白特征值也可能不同。

这里有一个示例：

import numpy as np
import scipy.sparse.linalg as ll
from sklearn.decomposition import PCA
a = np.array([[0,0,0],[0,0,1],[0,1,0]])
w1, v1 = ll.eigs(a, k=3)
w2 = PCA(n_components=3).fit(a).explained_variance_

w1.real
array([ 1., -1.,  0.])
w2
array([0.5       , 0.16666667, 0.        ])

你会发现w1和w2具有不同的特征值。我不确定是否存在一些基本线性代数概念的误解，或者我的代码是否有问题。

英文:

I am working on some principal component analysis and I happened to run into a couple of different ways of getting eigenvectors and eigenvalues.

Specifically, I found that in scipy.sparse.linalg there's an eigs method, and in sklearn.decomposition.PCA(), I can also get the eigenvalues by accessing the explained_variance_ attribute.

However, I've run it a couple of times and I am getting some mismatches in eigenvalues. I understand that it's possible for eigenvectors to be different because they may be scalar multiples, but I don't understand how eigenvalues could also differ.

Here's an example:

import numpy as np
import scipy.sparse.linalg as ll
from sklearn.decomposition import PCA
a = np.array([[0,0,0],[0,0,1],[0,1,0]])
w1, v1 = ll.eigs(a, k=3)
w2 = PCA(n_components=3).fit(a).explained_variance_

w1.real
array([ 1., -1.,  0.])
w2
array([0.5       , 0.16666667, 0.        ])

You'll see that w1 and w2 have different eigenvalues. I'm not sure if I'm misunderstanding some basic linear algebra concepts here, or if something's wrong with my code.

答案1

得分: 1

scikit-learn的PCA fit()方法接受一个形状为(n_samples, n_features)的数据集X作为输入，其中n_samples是样本数量，n_features是特征数量，然后对X的(n_features, n_features)协方差矩阵进行分解。而scipy的eigs()方法直接以要分解的矩阵作为输入。

这意味着为了获得类似的特征值，您应该将scikit-learn的PCA应用于一个协方差矩阵接近a的数据集X，请参见下面的示例：

import numpy as np
import scipy.sparse.linalg as ll
from sklearn.decomposition import PCA
from sklearn.datasets import make_spd_matrix
# 维度数量
n_dim = 3
# 协方差矩阵
a = make_spd_matrix(n_dim=n_dim, random_state=42)
# 具有给定协方差矩阵的数据集
np.random.seed(42)
X = np.random.multivariate_normal(mean=np.zeros(n_dim), cov=a, size=100000)
# 分解
w0 = np.linalg.eig(a)[0]
w1 = ll.eigs(a, k=n_dim, return_eigenvectors=False)
w2 = PCA(n_components=n_dim).fit(X).explained_variance_
# 特征值
print([format(w, '.3f') for w in np.sort(w0.real)[::-1]])
print([format(w, '.3f') for w in np.sort(w1.real)[::-1]])
print([format(w, '.3f') for w in np.sort(w2.real)[::-1]])
# ['3.616', '0.841', '0.242']
# ['3.616', '0.841', '0.242']
# ['3.616', '0.841', '0.242']

英文:

scikit-learn's PCA fit() method takes as input a dataset X of shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features, and then decomposes the (n_features, n_features) covariance matrix of X, while scipy's eigs() takes directly the matrix to be decomposed as input.

This means that in order to obtain similar eigenvalues you should fit scikit-learn's PCA to a dataset X with covariance matrix close to a, see the example below:

import numpy as np
import scipy.sparse.linalg as ll
from sklearn.decomposition import PCA
from sklearn.datasets import make_spd_matrix
# number of dimensions
n_dim = 3
# covariance matrix
a = make_spd_matrix(n_dim=n_dim, random_state=42)
# dataset with given covariance matrix
np.random.seed(42)
X = np.random.multivariate_normal(mean=np.zeros(n_dim), cov=a, size=100000)
# decompositions
w0 = np.linalg.eig(a)[0]
w1 = ll.eigs(a, k=n_dim, return_eigenvectors=False)
w2 = PCA(n_components=n_dim).fit(X).explained_variance_
# eigenvalues
print([format(w, &#39;.3f&#39;) for w in np.sort(w0.real)[::-1]])
print([format(w, &#39;.3f&#39;) for w in np.sort(w1.real)[::-1]])
print([format(w, &#39;.3f&#39;) for w in np.sort(w2.real)[::-1]])
# [&#39;3.616&#39;, &#39;0.841&#39;, &#39;0.242&#39;]
# [&#39;3.616&#39;, &#39;0.841&#39;, &#39;0.242&#39;]
# [&#39;3.616&#39;, &#39;0.841&#39;, &#39;0.242&#39;]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

sklearn.decomposition.PCA 中 explained_variance_ 的用法

问题

答案1

无法在Python的Selenium中正确使用XPATH。

我的课程自动运行的原因是什么？

PyScript：在HTML段落之间运行代码块？

Fraction类的构造函数和from_float在Python中有什么区别？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。