英文:
How exactly does Standard Scaling a Sparse Matrix work?
问题
I'm here to provide the translation for your code-related content:
我目前正在阅读《Scikit-Learn、Keras和TensorFlow实战机器学习》,其中提到了一个提示:“如果您想要对稀疏矩阵进行缩放,而不需要先将其转换为密集矩阵,可以使用StandardScaler,并将其with_mean超参数设置为False:这将仅通过标准差除以数据,而不会减去均值(因为这会破坏稀疏性)。”因此,我尝试了一下以了解其作用。然而,结果似乎没有进行缩放。我从np.array创建了一个csr_matrix,并使用with_mean=False作为参数使用了StandardScaler。之后,我对矩阵进行了fit_transform。非零结果都相同,没有进行缩放。我甚至不理解结果是如何计算的。我认为均值被设置为零,然后我们基于相应列的标准差来缩放每个非零值,但这种方法本应给我缩放后的值1.732,与输出不同。
这是代码示例:
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import numpy as np
X = csr_matrix(np.array([[0, 0, 1], [0, 2, 0], [3, 0, 0]]))
scaler = StandardScaler(with_mean=False)
X_scaled = scaler.fit_transform(X)
print(X_scaled)
print(X_scaled.toarray())
这会输出:
(0, 2) 2.1213203435596424
(1, 1) 2.1213203435596424
(2, 0) 2.1213203435596424
[[0. 0. 2.12132034]
[0. 2.12132034 0. ]
[2.12132034 0. 0. ]]
我是否做错了什么,还是我理解错了什么?
我不确定这是否是我所期望的。
英文:
I am currently reading "Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow" and came across a Tip stating "If you want to scale a sparse matrix without converting it to a dense matrix first, you can use a StandardScaler with its with_mean hyperparameter set to False: it will only divide the data by the standard deviation, without subtracting the mean (as this would break sparsity)." so I tried it out to understand what it is doing. However, the result does not seem to be scaled at all. I created a csr_matrix from a np.array and used a StandardScaler with with_mean=False as parameter. After that, I fit_transformed the matrix. The non-zero results are all the same and nothing is scaled. I don't even understand how the results are calculated. I thought the mean value is set to zero and we are scaling every non-zero value based on the standard deviation of its corresponding column but this method would've given me the scaled value 1.732 which is not the same as the output.
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import numpy as np
X = csr_matrix(np.array([[0, 0, 1], [0, 2, 0], [3, 0, 0]]))
scaler = StandardScaler(with_mean=False)
X_scaled = scaler.fit_transform(X)
print(X_scaled)
print(X_scaled.toarray())
This outputs:
(0, 2) 2.1213203435596424
(1, 1) 2.1213203435596424
(2, 0) 2.1213203435596424
[[0. 0. 2.12132034]
[0. 2.12132034 0. ]
[2.12132034 0. 0. ]]
Am I doing something wrong or am I misunderstanding something?
I'm not sure if this is what I expected.
答案1
得分: 0
使用with_mean=False
,StandardScaler
仅将每列除以其标准差。如下所示,对于任何数字i
,这将返回2.12132...
for i in range(1,4):
s = np.std([0,0,i])
print(f"{i} / {s:0.5f} = {i/s:0.5f}")
>>> 1 / 0.47140 = 2.12132
>>> 2 / 0.94281 = 2.12132
>>> 3 / 1.41421 = 2.12132
英文:
With with_mean=False
, StandardScaler
is only dividing each column by its standard deviation.
As you can see below, for any number i
, this will return 2.12132...
for i in range(1,4):
s = np.std([0,0,i])
print(f"{i} / {s:0.5f} = {i/s:0.5f}")
>>> 1 / 0.47140 = 2.12132
>>> 2 / 0.94281 = 2.12132
>>> 3 / 1.41421 = 2.12132
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论