2023年3月9日 17:08:42go评论99阅读模式

英文:

I have a balanced dataset, after I split it to train & test set, the test set is imbalance, what is the reason?

问题

我的整个数据集有8个类别，每个类别有100个主题。在进行分类时，我将其分割，测试集不平衡。

我进行了分割：

_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1], data.iloc[:, -1], test_size=0.2, random_state=42)

RFC的混淆矩阵是
在此输入图片描述
例如：第二类只有10个样本，为什么会这样，我应该进行平衡吗？

谢谢大家。

英文:

My whole dataset is 8 classes, 100 subjects for each class.
When I do the classification, I split it, the test set is imbalance.

I do split:

_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1], data.iloc[:, -1], test_size=0.2, random_state=42)

The confusion matrix of RFC is
enter image description here
eg: There's only 10 from the second class, why and should I balance it?

Thank you all.

答案1

得分: 2

train_test_split 函数来自 scikit-learn，以随机方式分割类别数。

要保持测试集和训练集中类别数量相等，您需要添加 "stratify" 参数。

查看文档

_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1], 
    data.iloc[:, -1], 
    test_size=0.2, 
    random_state=42,
    stratify=data.iloc[:, -1])

英文:

The train_test_split function from scikit-learn splits classes number in random fashion.

To keep the number of classes equal in the test and train set you need to add the "stratify" argument.

See documentation

_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1], 
    data.iloc[:, -1], 
    test_size=0.2, 
    random_state=42,
    stratify=data.iloc[:, -1])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

I have a balanced dataset, after I split it to train & test set, the test set is imbalance, what is the reason?

问题

答案1

提取PDF文件中的标题和子标题的正则表达式，使用Python。

Bug in Secret Auction Program For Loop

Pandas：更改对象的值

使用Selenium Python进行网页抓取选择下拉选项。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。