2023年7月17日 15:54:38go评论105阅读模式

英文:

How to deal with very imbalanced classes when doing NLP classification?

问题

我正在处理一个自然语言处理分类问题，我注意到各类别之间存在着巨大的差异。

我使用的数据集有大约44,000个样本，共有99个标签。在这99个标签中，只有21个标签有超过500个样本，而有些标签只有2个样本。以下是排名前21的标签：

你们有什么建议？我应该在某个阈值之后删除不存在的标签吗？我研究了一些数据增强技术，但在如何在法语中应用这些技术方面找不到清晰的文档。

如果你们需要更多细节，请告诉我！

编辑： 因此，我创建了一个名为“autre”（英文意为“其他”）的类别，将所有少数类别（少于300次出现）放在其中。因此，数据分布现在如下所示：

然后，我编写了以下代码以过采样少数类别：

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# 计算每个类别的出现次数，并将其存储在“Occurrences”列中
training['Occurrences'] = training['Domaine sou domaine '].map(training['Domaine sou domaine '].value_counts())
# 设置每个类别的期望比例（例如，0.5表示每个类别的出现次数至少为最大计数的50%）
desired_ratio = 0.5
# 随机欠采样以减少不平衡程度
balanced_data = pd.DataFrame()
for label in training['Domaine sou domaine '].unique():
    max_occurrences = training['Occurrences'].max()
    desired_occurrences = int(max_occurrences * desired_ratio)
    
    # 设置replace=True以进行有放回抽样
    samples = training[training['Domaine sou domaine '] == label].sample(n=desired_occurrences, replace=True)
    balanced_data = pd.concat([balanced_data, samples])
# 选择指定的列作为特征
cols = ["Programme de formation", "Description du champ supplémentaire : Objectifs de la formation", "Intitulé (Ce champ doit respecter la nomenclature suivante : Code action – Libellé)_y"]
X = balanced_data[cols]
# 提取标签
y = balanced_data['Domaine sou domaine ']
# 将数据拆分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

问题在于模型变得“过于优秀”。以前最好的准确度是58%，但现在当val_loss最小时，它达到了85%。

问题： 模型是否过拟合于小类别？例如，让我们以312次观察次数为最少重复的类别为例。根据此公式（desired_occurrences = int(max_occurrences * desired_ratio)），我们将随机重复这些观察近10次。

如果模型确实过拟合，我不应该认真对待85%的准确度，我接下来该怎么做？

英文:

I'm working on a NLP classification problem and I noticed that there is a huge disparities between classes.

I'm working with a dataset with 44k~ observations with 99 labels. Out of those 99 labels, only 21 have more than 500 observations and some have as little as 2 observations. Here is a look at the top 21 labels:

What do you guys suggest I should do? Should I just remove labels that don't exist after a certain threshold? I looked into data augmentation techniques but I couldn't find clear documentation for how to do it with the French language.

If you need me to provide more details please let me know!

EDIT: So I created a category called "autre" ( means "other" in English) in which I put all of the underrepresented categories ( under 300 occurances ). So the data repartition looks like this now:

then I wrote this code to oversample from the underrepresented categories:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Calculate the occurrences of each category and store it in the &#39;Occurrences&#39; column
training[&#39;Occurrences&#39;] = training[&#39;Domaine sou domaine &#39;].map(training[&#39;Domaine sou domaine &#39;].value_counts())
# Set the desired ratio for each category (e.g., 0.5 means each category will have occurrences at least 50% of the maximum count)
desired_ratio = 0.5
# Random undersampling to reduce the degree of imbalances
balanced_data = pd.DataFrame()
for label in training[&#39;Domaine sou domaine &#39;].unique():
    max_occurrences = training[&#39;Occurrences&#39;].max()
    desired_occurrences = int(max_occurrences * desired_ratio)
    
    # Set replace=True to sample with replacement
    samples = training[training[&#39;Domaine sou domaine &#39;] == label].sample(n=desired_occurrences, replace=True)
    balanced_data = pd.concat([balanced_data, samples])
# Selecting the specified columns as features
cols = [&quot;Programme de formation&quot;, &quot;Description du champ suppl&#233;mentaire : Objectifs de la formation&quot;, &quot;Intitul&#233; (Ce champ doit respecter la nomenclature suivante : Code action – Libell&#233;)_y&quot;]
X = balanced_data[cols]
# Extracting the labels
y = balanced_data[&#39;Domaine sou domaine &#39;]
# Splitting the data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The problem here is that the model became TOO GOOD. I was getting at best 58% accuracy but now it's 85% when the val_loss is at minimum.

QUESTION: Is the model overfitting with the small classes? Example: let's take the least repeated category with 312 observations. We will be repeating those observations randomly almost 10 times according to this formula(desired_occurrences = int(max_occurrences * desired_ratio).

If the model is indeed overfitting and I shouldn't take the 85% accuracy seriously, what should I do next?

答案1

得分: 4

由于训练集和测试集中存在重复，造成了 85% 的影响 
由于在对数据集进行重新平衡之后才执行 train_test_split，因此可以解释为什么会有 85% 的影响。在重新平衡时，某些示例会被复制，因此在拆分后可能会出现在训练集和测试集中。避免这种情况的方法是先拆分，然后再重新平衡训练集。请注意，您使用了 sample(replace=true)。

移除小分类 
这取决于具体情况，但通常情况下，我会在移除极度低表示的分类而不是创建一个“其他”类别时获得更好的结果。在推断阶段，对于这些分类的预测，置信水平可能（希望）会更低。然后，如果您设置一个最小阈值，就不会对这些情况进行预测。当然，这仅在您可以构建一个备用过程时才有效。

检查模型是否过拟合 
如果您有一个具有代表性的测试集，并且其中没有重复包含在训练集中的样本，那么您可以假设性能是合理的。

数据增强 
对于文本来说，一个相当有效的数据增强步骤是机器翻译。将您低表示的从句样本翻译成不同的语言，然后再翻译回来。 例如：FR -> ESP --> EN --> FR。使用更具异国情调的语言将会产生更多样化的新样本。

英文:

85% because of duplicates in train and testset 
The 85% can be explained because your train_test_split is executed after you rebalance your dataset. While rebalancing certain examples will be duplicated and can thus occur in your training and testset after the split. Avoid this by splitting first and then rebalancing your trainingset. Note that you have sample(replace=true).

Remove small categories 
It is always depending on the use case but I often obtain better results when removing highly under-represented categories instead of creating an 'others' category. At inference, the confidence levels for predictions of such categories are likely (hopefully) to be lower. If you then set a minimum threshold, no prediction is made for these cases. Of course, this only works if you can build a fallback process.

Check if model is overfitting 
If you have a representative testset and no samples of the trainset are repeated in it. You can assume that the performance is legit.

Data augmentation 
A data augmentation steps that works quite will for text is machine translation. Translate the samples of your under-represented clauses to different languages and back. Eg. FR -> ESP --> EN --> FR. Using more exotic languages will result in more diverse new samples.

答案2

得分: 1

你对应对不平衡问题的初步想法我认为是正确的。特别是，

> 你们建议我应该怎么做？我应该只是在一定阈值之后删除不存在的标签吗？

如果适用的话，这当然是一个可以考虑的情况，但需要知道结果模型的用例。如果需要包括每个标签，这可能会作为一个初始实验，你可以了解模型的性能以及改进的尝试。因此，尝试一下也没有坏处。

> 我研究了一些数据增强技术，但我找不到如何在法语中应用它的明确文档。

对于文本数据的增强有点模糊而相对困难（相对于CV/音频），但是有文本数据增强的技巧。虽然这些技巧大多数是以英文为主，但你可以尝试将增强的思想转移到另一种语言上（如果适用的话）。当然，对于这一点，不应该有针对特定语言（例如英语）的方法。通常情况下，这些方法使用硬编码的词汇表或类似的方法。你可以在这里尝试一些概述的技巧，而我会建议你首先尝试回译（不特定于语言，你可以使用当前的模型/APIs 来尝试许多语言）。

然而，我认为在指示性模型崛起后，现在可能有更多的方法（例如 ChatGPT）。你可以尝试通过一些提示工程的方式为你的类别生成实例。

第三种方式可能是使用基于算法的技术，例如包含类别不平衡的损失函数（例如 Focal loss，其他类别平衡/重新加权损失）。这些损失函数通常被视觉社区使用，没有理由不在NLP中使用它。

在此之后，我的建议是查阅有关不平衡学习的文献，因为有大量研究可供参考。文献通常将方法分为4个方面；但我可以以更高层次写作，将其分为3种基于数据的技术，基于算法的技术和混合方法。在这方面，一篇最近的综述论文将是看当前SOTA方法和方法历史上如何发展的一流选择。例如，你可以先阅读这篇论文，然后根据需要追溯文献。

英文:

Your initial thoughts to encounter the imbalance problem are in the correct path I believe. Particularly,

> What do you guys suggest I should do? Should I just remove labels that don't exist after a certain threshold?

This can certainly be the case if it is applicable, without knowing the usecase of the outcome model. If every label is required to be included, this might serve as an initial experiment anyway, and you can get insights about the model performance and trials for improvement. Thus, no hurt trying this.

> I looked into data augmentation techniques but I couldn't find clear documentation for how to do it with the french language.

Augmentation on textual data is a bit blurry and relatively difficult (to CV/Audio), but there are techniques for textual data augmentation. Although these techniques mostly reside in English, you can try to transfer the augmentation idea to another language if applicable. Of course, for this there should be no methods for a specific language (e.g. English). Mostly, these are the ones with hardcoded vocab or alike. You can try few techniques outlined here while I'd go with back translation first if I were you (not lang. specific, and you can do with many languages with current models/APIs).

However, I think there are currently more methods after the rise of instruction following models (e.g. ChatGPT). You can try to prompt engineer your way a bit to generate instances for your categories.

A third way could be to use a algorithm-based technique such as a loss function incorporating the class imbalance (e.g. Focal loss, other class balancing/re-weighting losses). These loss functions are generally used by vision community, there is no reason not to use it in NLP.

My advise after here is to review the literature on imbalance learning as there are loads of studies out there. The literature generally divides the methods in 4 folds; but I can write in a higher level as 3 data-based techniques, algorithm-based techniques, hybrid methods. Here, a recent survey paper would be the 1st choice to see both current SOTA methods and the general view of how the methods historacilly evolved. For example, you can start by reading this paper, and backtrack the literature as needed.

答案3

得分: 0

我建议做两件事：

在分割训练和测试数据时，最好在应用欠采样或过采样技术之前进行。我建议这样做的原因是：
a) 在这种情况下，您将在更现实的情况下衡量测试数据的性能。在新数据中，您可能会遇到不平衡的数据，最好知道您的模型在这些真实情况下的性能如何。
b) 您的训练集数据不会出现在测试集中（我认为这可能是您在指标上看到改进的主要原因，尽管我可能错了）。您可以使用train_test_split(..., stratify=training['Domaine_sous_domaine'])参数以分层方式拆分数据。
我看到您正在使用欠采样技术，但我建议尝试过采样，因为更多的数据通常会导致更好的模型。您可以使用imblearn.over_sampling.RandomOverSampler来代替手动计算。

平衡类别的另一种方法是找到负责此操作的分类模型参数。例如，在LogisticRegression和RandomForestClassifier中，您可以设置class_weight='balanced'。

希望这有所帮助，祝您好运！

英文:

I would suggest doing two things:

When splitting your data for training and testing, it would be better to do so before applying undersampling or oversampling techniques. I suggest this because:
a) In this situation, you will measure the performance of your test data in a more realistic situation. With new data, you're likely to encounter imbalanced data, and it's better to know how your model will perform under these real circumstances.
b) Your training set data will not be in the test set (I think that's the primary reason why you're seeing an improvement in your metrics, though I could be mistaken). You can use the train_test_split(..., stratify=training['Domaine_sous_domaine']) parameter to split the data in a stratified manner.
I see that you're using undersampling techniques, but I suggest trying oversampling because more data tends to lead to a better model. You could use imblearn.over_sampling.RandomOverSampler instead of manual calculations.

Another way to balance your classes is to find the parameter in your classification model responsible for this. For instance, in LogisticRegression and RandomForestClassifier, you can set up class_weight='balanced'.

I hope this helps, and good luck!

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在进行自然语言处理分类时处理非常不平衡的类别？

问题

答案1

答案2

答案3

从列多级索引的数据框中选择两个不同的列集。

Boto3如何将公共IP地址附加到网络接口

如何在 Polars 中更好地使用 apply？

为什么Firefox的Selenium WebDriver不能处理超过20个标签页？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。