在使用MinMaxScaler()或StandardScaler()进行标准化时,是否需要排除OneHotEncoded列?

huangapple go评论48阅读模式
英文:

Do we need to exclude OneHotEncoded columns while standardizing or normalizing using MinMaxScaler() or StandardScaler()?

问题

这是在标准化之前最终清理过的DataFrame(df2)

我的代码:

scaler = StandardScaler()
df2[list(df2.columns)] = scaler.fit_transform(df2[list(df2.columns)])
df2

这将返回一个在标准化后的DataFrame,包括虚拟变量和类别列。这样做是否正确?还是应该只在标准化时指定数值列?

英文:

This is the final cleaned DataFrame (df2) before Standardizing

my code:
scaler=StandardScaler()
df2[list(df2.columns)]=scaler.fit_transform(df2[list(df2.columns)])
df2

This returns a DataFrame after Standardizing every column including dummies and categories. Is it correct way?...Or should we specify only numerical columns while standardizing?

答案1

得分: 0

不适用于minmax缩放器,因为对于只包含0和1的列,它将是一个单位矩阵。另一方面,对于独热编码的列,StandardScaler则是一个有趣的选项。如果将其应用于独热编码的列,代码将从1减小到与特定类别中的样本数成比例的数字。这归结为一个经验性问题,取决于什么适用于您的应用,因为两种方法都可以被证明是合理的。简单地对所有数据进行标准化是一种更“统一”的方式,因此总体上是一种更简单的方法,但最终机器学习是一个经验性领域。要做出提供最佳结果的选择。

英文:

It doesn't really matter for minmax scaler because with a column with just 0 and 1 it will be an identity. StandardScaller on the other hand is an interesting one. If you apply it to one hot encoded one the code will decrease from 1 to the number proportional to how many samples do you have in this specific category. This boils down to an empirical question of what works for your application, as both paths can be justified. Simply standarising everything is a more "unified" way so would be a simpler approach overall, but in the end ML is an empirical field. Do what provides you with best results.

huangapple
  • 本文由 发表于 2023年4月7日 02:42:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/75952767.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定