为什么我们不按照主成分分析中的特征向量顺序来排序标度特征?

huangapple go评论82阅读模式
英文:

Why do we not order the scaled features inline with eigen vectors in PCA?

问题

以下是翻译好的部分:

这段代码是我在网上看到的PCA(主成分分析)过程的典型示例。

在大多数情况下,特征数据(X_meaned)没有重新排序,但特征的特征向量和特征值被重新排序,这可能会导致在点积(步骤6)中应用错误的特征向量。有人能解释为什么这样做不会出错吗?

我已经添加了一个名为'Step_NEW'的步骤,为什么这是不必要的?

以下是代码的链接:https://www.askpython.com/python/examples/principal-component-analysis

import numpy as np

def PCA(X, num_components):

    #Step-1
    X_meaned = X - np.mean(X, axis=0)

    #Step-2
    cov_mat = np.cov(X_meaned, rowvar=False)

    #Step-3
    eigen_values, eigen_vectors = np.linalg.eigh(cov_mat)

    #Step-4
    sorted_index = np.argsort(eigen_values)[::-1]
    sorted_eigenvalue = eigen_values[sorted_index]
    sorted_eigenvectors = eigen_vectors[:, sorted_index]

    #Step-NEW 
    X_meaned = X_meaned[sorted_index]

    #Step-5
    eigenvector_subset = sorted_eigenvectors[:, 0:num_components]

    #Step-6
    X_reduced = np.dot(eigenvector_subset.transpose(), X_meaned.transpose()).transpose()

    return X_reduced
英文:

The code below is a typical example of the PCA process that I see on web.

In most cases the eigen vectors and values are re-ordered, but the features data (X_meaned) are not. Can someone explain why that is NOT applying the wrong vectors to the data in the dot.product (step-6)?

I have added a 'Step_NEW' why is that unnecessary?

https://www.askpython.com/python/examples/principal-component-analysis

import numpy as np
 
def PCA(X , num_components):
     
    #Step-1
    X_meaned = X - np.mean(X , axis = 0)
     
    #Step-2
    cov_mat = np.cov(X_meaned , rowvar = False)
     
    #Step-3
    eigen_values , eigen_vectors = np.linalg.eigh(cov_mat)
     
    #Step-4
    sorted_index = np.argsort(eigen_values)[::-1]
    sorted_eigenvalue = eigen_values[sorted_index]
    sorted_eigenvectors = eigen_vectors[:,sorted_index]

    #Step-NEW 
    X_meaned= X_meaned[sorted_index]
     
    #Step-5
    eigenvector_subset = sorted_eigenvectors[:,0:num_components]
     
    #Step-6
    X_reduced = np.dot(eigenvector_subset.transpose() , X_meaned.transpose() ).transpose()
     
    return X_reduced

答案1

得分: 1

这更像是一个数学问题而不是Python问题。

但我觉得你可能混淆了对数据的分量进行排序和对向量进行排序。
当对特征向量进行排序时,不是对它们的分量进行排序。

请注意,我不能百分百确定这是你的误解。只是存在误解。因为这行代码:

X_meaned= X_meaned[sorted_index]

没有任何意义。没有任何东西表明你不会得到“索引超出范围”的错误。sorted_index是0到N-1之间的索引,其中N是特征空间的维数(大多数情况下,除非你选择限制PCA,或者某些轴之间有100%的相关性,这在现实生活中几乎不会发生,所以大多数情况下,N也是X向量的维数。这是X[i]X_meaned[i]的维度。没有理由N也应该是X的大小(即数据的维度,而不是数据的数量)。

让我们用一个更简单的例子来说明,使用基本的二维几何。

假设你正在对一组点进行PCA:

X=np.array([
   [1.4,15],
   [2.1,20],
   [-3, -31],
   [-0.5, -4]
  ])

步骤1

是对这些点求均值。恰好均值已经是[0,0](因为我懒,所以我选择了一个均值已经是0的示例,并且因为这显然不是你误解的步骤)。所以X_meanedX相同。

步骤2

是计算协方差矩阵。
对角线元素就是每个分量的方差。
所以

Var(X₁) = (1.4² + 2.1² + (-3)² + (-0.5)²)/3 = 5.2066667
Var(X₂) = (15² + 20² + (-31)² + (-4)²)/3 = 534

非对角线元素(这里只有一个,因为协方差矩阵是对称的,而且在2维中只有一种可能的协方差)

Cov(X₁,X₂) = (1.4*15 + 2.1*20 + (-3)*(-31) + (-0.5)*(-4))/3 = 52.6666667

因此协方差矩阵为

[[5.20666667,    52.6666667],
 [52.6666667,    534       ]]

你的代码显然得到了相同的结果

np.cov(X, rowvar=False)
#array([[5.20666667, 52.6666667],
#       [52.6666667, 534       ]])

这本身已经给出了PCA结果的第一个提示:两个变量之间的协方差几乎等于方差的几何平均值(52.6667 ≈ √5.206667*534)。这意味着它们高度相关。

步骤3

是对这个协方差矩阵进行对角化。也就是找到另一个基(而不是传统的i、j基,即(1,0),(0,1)基),在这个基中,协方差矩阵是对角的。

我们可以手工完成这个过程,你可能在学校已经学过(计算特征多项式以找到2个特征值,然后解方程CovX=λX以找到特征空间)。

但总的来说,我们找到了2个特征值
λ₁ = 1.22×10⁻²⁷ 和 λ₂=539.2
和两个相关向量
u₁ = [0.995, -0.098] 和 u₂=[0.098, 0.995]

所以,这里重要的是理解这意味着什么。这意味着如果你使用这个基来代替传统的基(我们已经注意到,我们的数据主要遵循这个规则),那么这4个点的坐标,在新的基中显然与传统基中的坐标不同。

你的4个点,在新的坐标系中现在是

array([[ -0.07905101,  15.06498427],
       [  0.12680531,  20.10954799],
       [  0.05722047, -31.14477044],
       [ -0.10497477,  -4.02976182]])

例如,第三个点,在传统系统中其坐标为(-3,-31)(理解为这意味着它是-3*(1,0) + -31*(0,1)),在新系统中其坐标为(0.057, -31.145)。也就是说,该点现在是0.057u₁ -31.145u₂(你可以很容易地验证,确实如此,仍然是(-3,-31))。

这个坐标系统更好,因为首先,在这个坐标系统中,坐标是独立的。你可以轻松验证点们的协方差为零。

而且它分离了主要轴和不那么重要的轴,主要轴是发生大部分变化的轴,不那么重要的轴是只围绕主要趋势发生小变化的轴。

在我们的示例中,主要轴是y=10x的轴(正如我们已经注意到的,我们的数据主要遵循这个规则)。
而较不重要的轴是

英文:

Well this is more a math question than a python question.

But I feel that you are confusing sorting the components of the data, and sorting vectors.
When you sort eigenvectors, you don't sort their components.

Note that I am not 100% sure that this is your misunderstanding. Only that there is a misunderstanding. Since line

X_meaned= X_meaned[sorted_index]

doesn't make any sense. Nothing says that you would not have a "index out of bound" from this. sorted_index are indices between 0 and N-1, where N is the number of dimensions of your eigenspace (most of the time, unless either you choose to restrain the PCA, or you have a 100% correlation between some axis, which never happens in real life, so most of the time, N is also the dimension of your X vectors. That is the dimension of X[i] or X_meaned[i].
There is no reason why that N should also be the size of X (that is, not the dimension of your data, but the number of data).

Let's use a simpler example, using basic 2D-geometry.

Let say that your are trying to do a PCA on a set of points

X=np.array([
   [1.4,15],
   [2.1,20],
   [-3, -31],
   [-0.5, -4]
  ])

Step 1

is to mean this. It happens that mean is already [0,0] (because I am lazy, so I've chosen an example where it is, and because this is obviously not the step you are misunderstanding). So X_meaned is the same as X.

Step 2

is to compute covariance matrix
The diagonal elements are just the variance of each components.
So

Var(X₁) = (1.4² + 2.1² + (-3)² + (-0.5)²)/3 = 5.2066667
Var(X₂) = (15² + 20² + (-31)² + (-4)²)/3 = 534

And the non diagonal elements (here, there is just one, since covariance matrix is symmetric, and any way, with 2 dimensions, there is only one possible covariance

Cov(X₁,X₂) = (1.4*15 + 2.1*20 + (-3)*(-31) + (-0.5)*(-4))/3 = 52.6666667

Hence the covariance matrix

[[5.20666667,    52.6666667],
 [52.6666667,    534       ]]

Your code obviously gives the same result

np.cov(X, rowvar=False)
#array([[5.20666667, 52.6666667],
#       [52.6666667, 534       ]])

This alone gives a first hint about the PCA result: covariance between the two variables is almost identical to geometrical mean of variances (52.6667 ≈ √5.206667*534). Which means that they are highly correlated.

Step 3

is to diagonalize this. That is to find another base (instead of the traditionnal i,j base, that is (1,0),(0,1) base) in which that covariance matrix is diagonal.

We could do this by hand, as you have probably learned at school (compute characteristic polynomial to find 2 eigenvalues, and then solve equations CovX=λX to find eigenspaces).

But well, long story short, we find 2 eigenvalues
λ₁ = 1.22×10⁻² and λ₂=539.2
And two associated vectors
u₁ = [0.995, -0.098] and u₂=[0.098, 0.995]

So, here it is important to understand what it means. It means that if you use this basis instead of the canonical one, to represent your 4 data in X, then, the coordinates of those 4 points, which were obviously correlated in canonical basic (y is very similar to x with a factor 10), are not in the new basis.

Your 4 points, in the new coordinates system are now

array([[ -0.07905101,  15.06498427],
       [  0.12680531,  20.10954799],
       [  0.05722047, -31.14477044],
       [ -0.10497477,  -4.02976182]])

For example, the third one, whose coordinates were (-3.-31) in the canonical system (understand that this means that it is -3*(1,0) + -31*(0,1)), has coordinates (0.057, -31.145) in the new system. That is that point is also 0.057u₁ -31.145u₂ (and you can easily check that, indeed, is still (-3,-31)).

This coordinates system is better because, firstly, in this coordinates system, coordinates are independent. You can easily check that points

array([[ -0.07905101,  15.06498427],
       [  0.12680531,  20.10954799],
       [  0.05722047, -31.14477044],
       [ -0.10497477,  -4.02976182]])

have 0 covariance.

And also because it separates the main axis, the one along which most of the variation occurs, from the less important one, the one along which only small variation around the main tendancies occur.

In our example, main axis is the one for which y=10x (as we have already noticed, our data mainly follow this rule).
And the least important axis is small variation to that rule, that is an axis that quantity how different are nevertheless y and 10x.

To know which of our new axis u₁, u₂ (even if, in our case, it is obvious by watching it), you can rely on the eigenvalues.

Step 4

This is why it is customary (not compulsory) to sort the vector of the basis by their eigenvalue.

After all, what we are doing here, is just choosing axis to represent our data. The order of those axis are arbitrary (exactly like naming one axis x and the other y in the canonical basis was arbitrary. You could have chosen to call x what I've called y and y what I've called x. It is just a naming choice). So, since we can choose what ever order we want, there is no reason to stick to order u₁, u₂, and we could choose to use vectors u₂,u₁ for our new coordinates system if we want. The habit in PCA is to use vectors with highest eigenvalues first.

So, here, since eigenvalue associated with u₂ is λ₂=539, which is way bigger than eigenvalue associated with u₁, λ₁=0.0122, let's use (u₂,u₁) instead of (u₁,u₂) for our basis. Therefore our coordinates of our 4 points are now

array([[ 15.06498427,  -0.07905101],
       [ 20.10954799,   0.12680531],
       [-31.14477044,   0.05722047],
       [ -4.02976182,  -0.10497477]])

Exactly the same as before, but with swapped coordinates.

When I say that it is just an habit, that it is just customary, understand that there are more than aesthetical reason tho. That means that if you want to represent your data with less values (for example for a compression purpose), you can choose the keep only the first M columns. In my example, representing all our data along a single axis would no introduce much error, since almost all the information is on the first axis, now that I sorted my axis so that the main one is the first one.

With another example in dimension 100, you could choose to keep only the 10 main axis, which, if you have sorted your axis by eigenvalues is easily done by dropping all but the 1st 10 columns.

Note that the computation of coordinates in the new system is Step 5.
And note that we did it twice (once without doing step 4, and once with step 4), and that resulted in sorting the columns of our data (here, since we are in dimension 2, just swapping them), not in sorting the rows, that is the data themselves.

Sorting the rows have absolutely no meaning. There is never and there was never any particular order in those rows. And if the application have one (that is if you have a reason other than random to consider that [1.4, 15] should be the first of our 4 points), then, there is no reason why that order should change because we changed our coordinates system. No more than deciding to call y what we called x and x what we called y should change the order of the points.

Your "Step-New" nevertheless does that: it changes, for no reason, the order of the points themselves.

Note that the columns (the coordinates of the points in the new system) should change if you change the order of the vectors in the new basis. But that is done naturally simply by the fact that you performed step-4 before step-5 (in other words, the projection in the new basis is done only after you have chosen the new basis vectors order)

And to go back on my first remark: in my example, this step-New would not harm a lot (it would just swap the 1st and 2nd of our 4 examples data, which, after all, doesn't change anything, neither in good nor in bad, unless your application need your data in order, in which case you scrambled them. For example, if you want to plot them with predifined color to do some gating).

But if I have had 4 points in a 6 dimension space, then, your step-new would have resulted into an index error, when you try to reposition X[4] and X[5] that do not exist.

Sorry for the very long explanation. I got myself carried out. But I felt that the real problem here was the understanding of what you are doing when you are doing a PCA.

huangapple
  • 本文由 发表于 2023年6月5日 21:18:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76406822.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定