MinMaxScaler不会将小值缩放为1。

huangapple go评论90阅读模式
英文:

MinMaxScaler doesn't scale small values to 1

问题

sklearn.preprocessing.MinMaxScaler 和 sklearn.preprocessing.RobustScaler 存在奇怪的行为
当数据的最大值非常小,小于10^(-16)时,变换器不会改变数据的最大值,仍然保持原始数据的最大值。为什么?df_small.dtypes 是 float64,这种类型可以表示更小的数字。如何在不手动处理的情况下修复它:data = data / data.max()

df_small = pd.DataFrame((np.arange(5)*10.0**(-16)))
scaler_small = MinMaxScaler()
small_transformed = scaler.fit_transform(df_small)
print(small_transformed)

[[0.e+00]
 [1.e-16]
 [2.e-16]
 [3.e-16]
 [4.e-16]]

df_not_small = pd.DataFrame((np.arange(5)*10.0**(-15)))
scaler_not_small = MinMaxScaler()
not_small_transformed = scaler_not_small.fit_transform(df_not_small)
print(not_small_transformed)

[[0.  ]
 [0.25]
 [0.5 ]
 [0.75]
 [1.  ]]
英文:

I found weird behavior of sklearn.preprocessing.MinMaxScaler and same for sklearn.preprocessing.RobustScaler
When data max value is very small < 10^(-16) transformer doesn't change data max value from raw data max value. Why? df_small.dtypes is float64, this type can represent smaller numbers. How can I fix it without handcrafted: data = data / data.max()

df_small = pd.DataFrame((np.arange(5)*10.0**(-16)))
scaler_small = MinMaxScaler()
small_transformed = scaler.fit_transform(df_small)
print(small_transformed)

[[0.e+00]
 [1.e-16]
 [2.e-16]
 [3.e-16]
 [4.e-16]]

df_not_small = pd.DataFrame((np.arange(5)*10.0**(-15)))
scaler_not_small = MinMaxScaler()
not_small_transformed = scaler_not_small.fit_transform(df_not_small)
print(not_small_transformed)

[[0.  ]
 [0.25]
 [0.5 ]
 [0.75]
 [1.  ]]

答案1

得分: 1

在应用缩放时,MinMaxScaler 调用 _handle_zeros_in_scale() 函数,其中有以下检查:

constant_mask = scale < 10 * np.finfo(scale.dtype).eps

对于 dtypenp.float64 的情况,10 * np.finfo(scale.dtype).eps 的值为 2.220446049250313e-15,这比第二个情况中的 4e-16 大(但小于第一个情况中的范围 4e-15)。如果缩放比这个值小,它将缩放因子设置为 1(参见此行):

scale[constant_mask] = 1.0

不幸的是,您要么需要手动缩放数据,要么编辑 scikit-learn 以允许具有较小总体范围的样本。

英文:

When it's applying the scaling to use, the MinMaxScaler calls the _handle_zeros_in_scale() function, which has the check:

constant_mask = scale &lt; 10 * np.finfo(scale.dtype).eps

For a dtype that is np.float64, the value of 10 * np.finfo(scale.dtype).eps is 2.220446049250313e-15, which is larger than your scale of 4e-16 in the second case (but smaller than the range 4e-15 in the first case). If the scale is smaller than this, it sets the scale factor to 1 (see this line):

scale[constant_mask] = 1.0

Unfortunately, you'll either have to manually scale the data yourself, or edit scikit-learn to change it to allow samples with smaller overall ranges.

huangapple
  • 本文由 发表于 2023年7月24日 18:56:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76753784.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定