2023年5月21日 22:59:07go评论91阅读模式

英文:

How to measure performance on a highly unbalanced dataset?

问题

Here are the translated portions of your text:

"我正在努力寻找在给定高度不平衡的数据集的情况下衡量我的模型性能的最佳方法。我的数据集涉及二元分类问题，即预测中风病例。负面病例有3364个，正面病例有202个。

在这种情况下，F1分数会是最重要的度量标准，对吗？但这个度量标准总是非常低，我还在计算ROC曲线，但不确定在这种情况下它是否有用。在平衡数据时，请注意我只平衡了训练集，保持测试集不变。

以下是代码：

拆分训练和测试数据：

x_train, x_test, y_train, y_test = train_test_split(x_base, y_base)

接收过采样的训练集并打印指标的函数：

def reportSample(x_resampled,y_resampled,name):
    print(name)
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import classification_report, fbeta_score,roc_auc_score
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_classifier.fit(x_resampled,y_resampled)
    from sklearn.metrics import accuracy_score
    previsoes = rf_classifier.predict(x_test)
    report = classification_report(y_test, previsoes)
    probabilidades = rf_classifier.predict_proba(x_test)[:, 1]
    auc = roc_auc_score(y_test, probabilidades)
    print(report)
    print("AUC = ",auc)

RandomOverSampler：

from imblearn.over_sampling import RandomOverSampler
over_sampler = RandomOverSampler(sampling_strategy=0.5)
x_resampled, y_resampled = over_sampler.fit_resample(x_train, y_train)
reportSample(x_resampled,y_resampled,"Random over sampler")

NearMiss：

from imblearn.under_sampling import NearMiss
nearmiss = NearMiss(version=2,sampling_strategy='majority')
x_resampled, y_resampled = nearmiss.fit_resample(x_train, y_train)
reportSample(x_resampled,y_resampled,"NearMiss underSample")

Smote：

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
x_resampled,y_resampled = sm.fit_resample(x_train,y_train)
reportSample(x_resampled,y_resampled,"Smote over sampling")

三种方法的分类报告：

Nearmiss分类报告
 Random分类报告
 Smote分类报告"

Please note that I've retained the original code and metric names in English, as they are commonly used in programming and data analysis contexts.

英文:

I am struggling to find the optimal way to measure the performance of my model given a highly unbalanced dataset.My dataset is about the binary classification problem of predicting stroke cases. The ratio is 3364 negative cases and 202 positive cases.

In this case f1-score would be the most important metric in this context, correct? But this metric always comes out extremely low, im also calculating the ROC curve but im not sure if it is useful in this case.When balancing the data note that im only balancing only the training set, and leaving the test set intact.

Here's the code:

Spliting the training and test data:

x_train, x_test, y_train, y_test = train_test_split(x_base, y_base)

Function that receives the resampled training set and prints the metrics:

def reportSample(x_resampled,y_resampled,name):
    print(name)
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import classification_report, fbeta_score,roc_auc_score
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_classifier.fit(x_resampled,y_resampled)
    from sklearn.metrics import accuracy_score
    previsoes = rf_classifier.predict(x_test)
    report = classification_report(y_test, previsoes)
    probabilidades = rf_classifier.predict_proba(x_test)[:, 1]
    auc = roc_auc_score(y_test, probabilidades)
    print(report)
    print(&quot;AUC = &quot;,auc)

RandomOverSampler:

from imblearn.over_sampling import RandomOverSampler
over_sampler = RandomOverSampler(sampling_strategy=0.5)
x_resampled, y_resampled = over_sampler.fit_resample(x_train, y_train)
reportSample(x_resampled,y_resampled,&quot;Random over sampler&quot;)

NearMiss:

from imblearn.under_sampling import NearMiss
nearmiss = NearMiss(version=2,sampling_strategy=&#39;majority&#39;)
x_resampled, y_resampled = nearmiss.fit_resample(x_train, y_train)
reportSample(x_resampled,y_resampled,&quot;NearMiss underSample&quot;)

Smote:

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
x_resampled,y_resampled = sm.fit_resample(x_train,y_train)
reportSample(x_resampled,y_resampled,&quot;Smote over sampling&quot;)

Classification reports of all 3 methods:

[Nearmiss cr](https://i.stack.imgur.com/6M8FL.png)
[RandomCr](https://i.stack.imgur.com/yvZB8.png)
[SmoteCr](https://i.stack.imgur.com/lIDHz.png)

答案1

得分: 2

这是要翻译的内容：

找出您实际希望模型执行的任务。您更关心来自某个类别的正确预测吗？您是否关心最小化假阳性？等等。
了解每个度量指标实际为您提供的信息。如果您不确定是否应在此情境中使用某个度量指标，那么您可能不太了解这些度量指标 - 详细了解其作用。
结合使用多种度量指标。每个度量指标告诉您不同的信息，您可能最终需要平衡竞争性度量指标。

如果愿意，您可以根据您定义的一些重要性标准，结合多个度量指标的结果。

英文:

It's very difficult for someone to give you a correct answer to this question, since it depends on your specific needs. Ultimately, the answer will involve the following:

Figure out what you actually want your model to do. Do you care more about correct predictions from one of the classes? Do you care about minimising false-positives? Etc. etc.
Learn what information each metric actually provides you. You probably don't understand the metrics well enough if you aren't sure if one you're using is worth using in this scenario - read up on what it does.
Use a variety of metrics in combination. Each metric tells you something different and you'll likely end up balancing competing metrics.

If you like, you can combine the results of multiple metrics based on some importance criteria you define.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何衡量高度不平衡数据集的性能？

问题

答案1

Poetry add pkg got "HTTPResponse.init() got an unexpected keyword argument 'strict'"

保留 pandas.series.str.extract() 之后的原始字符串值，如果正则表达式不匹配。

Pandas 中根据动态值进行列搜索的向量化处理

在pandas中的一行上有多个记录。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。