2023年4月7日 02:42:15go评论87阅读模式

英文:

Do we need to exclude OneHotEncoded columns while standardizing or normalizing using MinMaxScaler() or StandardScaler()?

问题

这是在标准化之前最终清理过的DataFrame（df2）

我的代码：

scaler = StandardScaler()
df2[list(df2.columns)] = scaler.fit_transform(df2[list(df2.columns)])
df2

这将返回一个在标准化后的DataFrame，包括虚拟变量和类别列。这样做是否正确？还是应该只在标准化时指定数值列？

英文:

This is the final cleaned DataFrame (df2) before Standardizing

my code:
scaler=StandardScaler()
df2[list(df2.columns)]=scaler.fit_transform(df2[list(df2.columns)])
df2

This returns a DataFrame after Standardizing every column including dummies and categories. Is it correct way?...Or should we specify only numerical columns while standardizing?

答案1

得分: 0

不适用于minmax缩放器，因为对于只包含0和1的列，它将是一个单位矩阵。另一方面，对于独热编码的列，StandardScaler则是一个有趣的选项。如果将其应用于独热编码的列，代码将从1减小到与特定类别中的样本数成比例的数字。这归结为一个经验性问题，取决于什么适用于您的应用，因为两种方法都可以被证明是合理的。简单地对所有数据进行标准化是一种更“统一”的方式，因此总体上是一种更简单的方法，但最终机器学习是一个经验性领域。要做出提供最佳结果的选择。

英文:

It doesn't really matter for minmax scaler because with a column with just 0 and 1 it will be an identity. StandardScaller on the other hand is an interesting one. If you apply it to one hot encoded one the code will decrease from 1 to the number proportional to how many samples do you have in this specific category. This boils down to an empirical question of what works for your application, as both paths can be justified. Simply standarising everything is a more "unified" way so would be a simpler approach overall, but in the end ML is an empirical field. Do what provides you with best results.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在使用MinMaxScaler()或StandardScaler()进行标准化时，是否需要排除OneHotEncoded列？

问题

答案1

如何解决sqlite3.connect.cursor.execute()函数的语法错误

重复使用 BigQuery 查询作业作为基础查询，以供进一步操作使用。

标准缩放稀疏矩阵的确切工作原理是怎样的？

Python3获取进程的基址地址从PID

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。