标准缩放稀疏矩阵的确切工作原理是怎样的?

huangapple go评论70阅读模式
英文:

How exactly does Standard Scaling a Sparse Matrix work?

问题

I'm here to provide the translation for your code-related content:

我目前正在阅读Scikit-LearnKeras和TensorFlow实战机器学习》,其中提到了一个提示:“如果您想要对稀疏矩阵进行缩放而不需要先将其转换为密集矩阵可以使用StandardScaler并将其with_mean超参数设置为False这将仅通过标准差除以数据而不会减去均值因为这会破坏稀疏性)。”因此我尝试了一下以了解其作用然而结果似乎没有进行缩放我从np.array创建了一个csr_matrix并使用with_mean=False作为参数使用了StandardScaler之后我对矩阵进行了fit_transform非零结果都相同没有进行缩放我甚至不理解结果是如何计算的我认为均值被设置为零然后我们基于相应列的标准差来缩放每个非零值但这种方法本应给我缩放后的值1.732与输出不同

这是代码示例

from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import numpy as np

X = csr_matrix(np.array([[0, 0, 1], [0, 2, 0], [3, 0, 0]]))
scaler = StandardScaler(with_mean=False)
X_scaled = scaler.fit_transform(X)
print(X_scaled)
print(X_scaled.toarray())

这会输出

  (0, 2)	2.1213203435596424
  (1, 1)	2.1213203435596424
  (2, 0)	2.1213203435596424

[[0.         0.         2.12132034]
 [0.         2.12132034 0.        ]
 [2.12132034 0.         0.        ]]

我是否做错了什么还是我理解错了什么

我不确定这是否是我所期望的
英文:

I am currently reading "Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow" and came across a Tip stating "If you want to scale a sparse matrix without converting it to a dense matrix first, you can use a StandardScaler with its with_mean hyperparameter set to False: it will only divide the data by the standard deviation, without subtracting the mean (as this would break sparsity)." so I tried it out to understand what it is doing. However, the result does not seem to be scaled at all. I created a csr_matrix from a np.array and used a StandardScaler with with_mean=False as parameter. After that, I fit_transformed the matrix. The non-zero results are all the same and nothing is scaled. I don't even understand how the results are calculated. I thought the mean value is set to zero and we are scaling every non-zero value based on the standard deviation of its corresponding column but this method would've given me the scaled value 1.732 which is not the same as the output.

from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import numpy as np

X = csr_matrix(np.array([[0, 0, 1], [0, 2, 0], [3, 0, 0]]))
scaler = StandardScaler(with_mean=False)
X_scaled = scaler.fit_transform(X)
print(X_scaled)
print(X_scaled.toarray())


This outputs:

  (0, 2)	2.1213203435596424
  (1, 1)	2.1213203435596424
  (2, 0)	2.1213203435596424

[[0.         0.         2.12132034]
 [0.         2.12132034 0.        ]
 [2.12132034 0.         0.        ]]

Am I doing something wrong or am I misunderstanding something?

I'm not sure if this is what I expected.

答案1

得分: 0

使用with_mean=FalseStandardScaler仅将每列除以其标准差。如下所示,对于任何数字i,这将返回2.12132...

for i in range(1,4):
    s = np.std([0,0,i])
    print(f"{i} / {s:0.5f} = {i/s:0.5f}")

>>> 1 / 0.47140 = 2.12132
>>> 2 / 0.94281 = 2.12132
>>> 3 / 1.41421 = 2.12132
英文:

With with_mean=False, StandardScaler is only dividing each column by its standard deviation.
As you can see below, for any number i, this will return 2.12132...

for i in range(1,4):
    s = np.std([0,0,i])
    print(f"{i} / {s:0.5f} = {i/s:0.5f}")

>>> 1 / 0.47140 = 2.12132
>>> 2 / 0.94281 = 2.12132
>>> 3 / 1.41421 = 2.12132

huangapple
  • 本文由 发表于 2023年6月29日 17:40:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76579867.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定