英文:
MinMaxScaler doesn't scale small values to 1
问题
sklearn.preprocessing.MinMaxScaler 和 sklearn.preprocessing.RobustScaler 存在奇怪的行为
当数据的最大值非常小,小于10^(-16)时,变换器不会改变数据的最大值,仍然保持原始数据的最大值。为什么?df_small.dtypes 是 float64,这种类型可以表示更小的数字。如何在不手动处理的情况下修复它:data = data / data.max()
df_small = pd.DataFrame((np.arange(5)*10.0**(-16)))
scaler_small = MinMaxScaler()
small_transformed = scaler.fit_transform(df_small)
print(small_transformed)
[[0.e+00]
[1.e-16]
[2.e-16]
[3.e-16]
[4.e-16]]
df_not_small = pd.DataFrame((np.arange(5)*10.0**(-15)))
scaler_not_small = MinMaxScaler()
not_small_transformed = scaler_not_small.fit_transform(df_not_small)
print(not_small_transformed)
[[0. ]
[0.25]
[0.5 ]
[0.75]
[1. ]]
英文:
I found weird behavior of sklearn.preprocessing.MinMaxScaler and same for sklearn.preprocessing.RobustScaler
When data max value is very small < 10^(-16) transformer doesn't change data max value from raw data max value. Why? df_small.dtypes is float64, this type can represent smaller numbers. How can I fix it without handcrafted: data = data / data.max()
df_small = pd.DataFrame((np.arange(5)*10.0**(-16)))
scaler_small = MinMaxScaler()
small_transformed = scaler.fit_transform(df_small)
print(small_transformed)
[[0.e+00]
[1.e-16]
[2.e-16]
[3.e-16]
[4.e-16]]
df_not_small = pd.DataFrame((np.arange(5)*10.0**(-15)))
scaler_not_small = MinMaxScaler()
not_small_transformed = scaler_not_small.fit_transform(df_not_small)
print(not_small_transformed)
[[0. ]
[0.25]
[0.5 ]
[0.75]
[1. ]]
答案1
得分: 1
在应用缩放时,MinMaxScaler
调用 _handle_zeros_in_scale()
函数,其中有以下检查:
constant_mask = scale < 10 * np.finfo(scale.dtype).eps
对于 dtype
为 np.float64
的情况,10 * np.finfo(scale.dtype).eps
的值为 2.220446049250313e-15
,这比第二个情况中的 4e-16 大(但小于第一个情况中的范围 4e-15)。如果缩放比这个值小,它将缩放因子设置为 1(参见此行):
scale[constant_mask] = 1.0
不幸的是,您要么需要手动缩放数据,要么编辑 scikit-learn 以允许具有较小总体范围的样本。
英文:
When it's applying the scaling to use, the MinMaxScaler
calls the _handle_zeros_in_scale()
function, which has the check:
constant_mask = scale < 10 * np.finfo(scale.dtype).eps
For a dtype
that is np.float64
, the value of 10 * np.finfo(scale.dtype).eps
is 2.220446049250313e-15
, which is larger than your scale of 4e-16 in the second case (but smaller than the range 4e-15 in the first case). If the scale is smaller than this, it sets the scale factor to 1 (see this line):
scale[constant_mask] = 1.0
Unfortunately, you'll either have to manually scale the data yourself, or edit scikit-learn to change it to allow samples with smaller overall ranges.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论