2020年1月4日 02:01:15go评论81阅读模式

英文:

Feature Engineering on imbalanced data

问题

我正在对一个分类问题训练机器学习模型。我的数据集包含10000个观测值，有37个分类类别。但是数据不平衡，一些类别只有100个观测值，而其他一些类别有3000和4000个观测值。

在搜索了如何在这种类型的数据上进行一些特征工程以提高算法性能后，我找到了2种解决方案：

上采样，意味着获取更多关于少数类别的数据
下采样，意味着删除关于多数类别的数据

根据第一个解决方案：
我有许多类别只有少数观测值，所以需要更多的数据和时间。这对我来说可能会很困难！

通过应用第二个解决方案：
我认为所有类别都将只有少数观测值，数据将非常小，算法很难泛化。

那么，我是否有其他解决方案可以尝试解决这个问题？

英文:

I am training a machine learning model on a classification problem. My dataset is 10000 observations with 37 categorical class. But the data is imbalanced, I have some classes with 100 observations and some other classes with 3000 and 4000 observations.
 After searching on how to do some feature engineering on this type of data to improve the performance of the algorithm. I found 2 solutions:

upsampling which means to get more data about the minority classes
downsampling which means to remove data about the majority classes

According to the first solution:
I have many classes with a few observations so it will require much more data and long time. so it will be a hard for me! 
And by applying the second one:
I think all classes will have a few observations and the data will be very small so that it will be hard for the algorithm to generalize.

So Is there another solution I can try for this problem?

答案1

得分: 1

你可以在损失函数中调整权重，使较小的类在优化时具有更大的重要性。例如，在Keras中，你可以使用weighted_cross_entropy_with_logits。

英文:

You can change the weights in your loss function so that the smaller classes have larger importance when optimizing. In keras you can use weighted_cross_entropy_with_logits, for example.

答案2

得分: 0

你可以使用两者的组合。

听起来你担心如果将所有少数类上采样以匹配多数类，可能会得到一个太大的数据集。如果是这种情况，你可以将多数类下采样到大约25%或50%，同时上采样少数类。上采样的替代方法是使用像SMOTE这样的算法来合成少数类的样本。

如果你正在批量训练神经网络，最好确保训练集被适当洗牌，并且在小批次中有均匀分布的少数类/多数类样本。

英文:

You could use a combination of both.

It sounds like you are worried about getting a dataset that is too large if you upsample all minority classes to match the majority classes. If this is the case, you can downsample the majority classes to something like 25% or 50%, and at the same time upsample the minority classes. An alternative to upsampling is synthesising samples for the minority classes using an algorithm like SMOTE.

If you are training a neural network in batch, it is good to make sure that the training set is properly shuffled and that you have an even-is distribution of minority/majority samples across the mini batches.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在不平衡数据上的特征工程

问题

答案1

答案2

为什么这不起作用，我已经安装了sklearn。当我尝试执行它时，它显示错误。

如何在Python中加载数据集并处理它，而不会超出内存限制？

关于”Word2Vec”向量化器将文本转换为数值表示的工作方式的查询

我的数据中用于训练股票价格预测模型的目标是什么？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。