在不平衡数据上的特征工程

huangapple go评论64阅读模式
英文:

Feature Engineering on imbalanced data

问题

我正在对一个分类问题训练机器学习模型。我的数据集包含10000个观测值,有37个分类类别。但是数据不平衡,一些类别只有100个观测值,而其他一些类别有30004000个观测值。

在搜索了如何在这种类型的数据上进行一些特征工程以提高算法性能后,我找到了2种解决方案:

  • 上采样,意味着获取更多关于少数类别的数据
  • 下采样,意味着删除关于多数类别的数据

根据第一个解决方案
我有许多类别只有少数观测值,所以需要更多的数据和时间。这对我来说可能会很困难!

通过应用第二个解决方案
我认为所有类别都将只有少数观测值,数据将非常小,算法很难泛化。

那么,我是否有其他解决方案可以尝试解决这个问题?

英文:

I am training a machine learning model on a classification problem. My dataset is 10000 observations with 37 categorical class. But the data is imbalanced, I have some classes with 100 observations and some other classes with 3000 and 4000 observations.
<br><br> After searching on how to do some feature engineering on this type of data to improve the performance of the algorithm. I found 2 solutions:<br>

  • upsampling which means to get more data about the minority classes
  • downsampling which means to remove data about the majority classes

According to the first solution:
I have many classes with a few observations so it will require much more data and long time. so it will be a hard for me!<br>
And by applying the second one:
I think all classes will have a few observations and the data will be very small so that it will be hard for the algorithm to generalize.

So Is there another solution I can try for this problem?
<br>

答案1

得分: 1

你可以在损失函数中调整权重,使较小的类在优化时具有更大的重要性。例如,在Keras中,你可以使用weighted_cross_entropy_with_logits

英文:

You can change the weights in your loss function so that the smaller classes have larger importance when optimizing. In keras you can use weighted_cross_entropy_with_logits, for example.

答案2

得分: 0

你可以使用两者的组合。

听起来你担心如果将所有少数类上采样以匹配多数类,可能会得到一个太大的数据集。如果是这种情况,你可以将多数类下采样到大约25%或50%,同时上采样少数类。上采样的替代方法是使用像SMOTE这样的算法来合成少数类的样本。

如果你正在批量训练神经网络,最好确保训练集被适当洗牌,并且在小批次中有均匀分布的少数类/多数类样本。

英文:

You could use a combination of both.

It sounds like you are worried about getting a dataset that is too large if you upsample all minority classes to match the majority classes. If this is the case, you can downsample the majority classes to something like 25% or 50%, and at the same time upsample the minority classes. An alternative to upsampling is synthesising samples for the minority classes using an algorithm like SMOTE.

If you are training a neural network in batch, it is good to make sure that the training set is properly shuffled and that you have an even-is distribution of minority/majority samples across the mini batches.

huangapple
  • 本文由 发表于 2020年1月4日 02:01:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/59583232.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定