如何衡量高度不平衡数据集的性能?

huangapple go评论52阅读模式
英文:

How to measure performance on a highly unbalanced dataset?

问题

Here are the translated portions of your text:

"我正在努力寻找在给定高度不平衡的数据集的情况下衡量我的模型性能的最佳方法。我的数据集涉及二元分类问题,即预测中风病例。负面病例有3364个,正面病例有202个。

在这种情况下,F1分数会是最重要的度量标准,对吗?但这个度量标准总是非常低,我还在计算ROC曲线,但不确定在这种情况下它是否有用。在平衡数据时,请注意我只平衡了训练集,保持测试集不变。

以下是代码:

拆分训练和测试数据:

x_train, x_test, y_train, y_test = train_test_split(x_base, y_base)

接收过采样的训练集并打印指标的函数:

def reportSample(x_resampled,y_resampled,name):
    print(name)
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import classification_report, fbeta_score,roc_auc_score
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_classifier.fit(x_resampled,y_resampled)
    from sklearn.metrics import accuracy_score
    previsoes = rf_classifier.predict(x_test)
    report = classification_report(y_test, previsoes)
    probabilidades = rf_classifier.predict_proba(x_test)[:, 1]
    auc = roc_auc_score(y_test, probabilidades)
    print(report)
    print("AUC = ",auc)

RandomOverSampler:

from imblearn.over_sampling import RandomOverSampler
over_sampler = RandomOverSampler(sampling_strategy=0.5)
x_resampled, y_resampled = over_sampler.fit_resample(x_train, y_train)
reportSample(x_resampled,y_resampled,"Random over sampler")

NearMiss:

from imblearn.under_sampling import NearMiss
nearmiss = NearMiss(version=2,sampling_strategy='majority')
x_resampled, y_resampled = nearmiss.fit_resample(x_train, y_train)
reportSample(x_resampled,y_resampled,"NearMiss underSample")

Smote:

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
x_resampled,y_resampled = sm.fit_resample(x_train,y_train)
reportSample(x_resampled,y_resampled,"Smote over sampling")

三种方法的分类报告:

Nearmiss分类报告
Random分类报告
Smote分类报告"

Please note that I've retained the original code and metric names in English, as they are commonly used in programming and data analysis contexts.

英文:

I am struggling to find the optimal way to measure the performance of my model given a highly unbalanced dataset.My dataset is about the binary classification problem of predicting stroke cases. The ratio is 3364 negative cases and 202 positive cases.

In this case f1-score would be the most important metric in this context, correct? But this metric always comes out extremely low, im also calculating the ROC curve but im not sure if it is useful in this case.When balancing the data note that im only balancing only the training set, and leaving the test set intact.

Here's the code:

Spliting the training and test data:

x_train, x_test, y_train, y_test = train_test_split(x_base, y_base)

Function that receives the resampled training set and prints the metrics:

def reportSample(x_resampled,y_resampled,name):
    print(name)
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import classification_report, fbeta_score,roc_auc_score
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_classifier.fit(x_resampled,y_resampled)
    from sklearn.metrics import accuracy_score
    previsoes = rf_classifier.predict(x_test)
    report = classification_report(y_test, previsoes)
    probabilidades = rf_classifier.predict_proba(x_test)[:, 1]
    auc = roc_auc_score(y_test, probabilidades)
    print(report)
    print("AUC = ",auc)

RandomOverSampler:

from imblearn.over_sampling import RandomOverSampler
over_sampler = RandomOverSampler(sampling_strategy=0.5)
x_resampled, y_resampled = over_sampler.fit_resample(x_train, y_train)
reportSample(x_resampled,y_resampled,"Random over sampler")

NearMiss:

from imblearn.under_sampling import NearMiss
nearmiss = NearMiss(version=2,sampling_strategy='majority')
x_resampled, y_resampled = nearmiss.fit_resample(x_train, y_train)
reportSample(x_resampled,y_resampled,"NearMiss underSample")

Smote:

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
x_resampled,y_resampled = sm.fit_resample(x_train,y_train)
reportSample(x_resampled,y_resampled,"Smote over sampling")

Classification reports of all 3 methods:

[Nearmiss cr](https://i.stack.imgur.com/6M8FL.png)
[RandomCr](https://i.stack.imgur.com/yvZB8.png)
[SmoteCr](https://i.stack.imgur.com/lIDHz.png)

答案1

得分: 2

这是要翻译的内容:

  1. 找出您实际希望模型执行的任务。您更关心来自某个类别的正确预测吗?您是否关心最小化假阳性?等等。

  2. 了解每个度量指标实际为您提供的信息。如果您不确定是否应在此情境中使用某个度量指标,那么您可能不太了解这些度量指标 - 详细了解其作用。

  3. 结合使用多种度量指标。每个度量指标告诉您不同的信息,您可能最终需要平衡竞争性度量指标。

如果愿意,您可以根据您定义的一些重要性标准,结合多个度量指标的结果。

英文:

It's very difficult for someone to give you a correct answer to this question, since it depends on your specific needs. Ultimately, the answer will involve the following:

  1. Figure out what you actually want your model to do. Do you care more about correct predictions from one of the classes? Do you care about minimising false-positives? Etc. etc.

  2. Learn what information each metric actually provides you. You probably don't understand the metrics well enough if you aren't sure if one you're using is worth using in this scenario - read up on what it does.

  3. Use a variety of metrics in combination. Each metric tells you something different and you'll likely end up balancing competing metrics.

If you like, you can combine the results of multiple metrics based on some importance criteria you define.

huangapple
  • 本文由 发表于 2023年5月21日 22:59:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76300524.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定