英文:
I have a balanced dataset, after I split it to train & test set, the test set is imbalance, what is the reason?
问题
我的整个数据集有8个类别,每个类别有100个主题。在进行分类时,我将其分割,测试集不平衡。
我进行了分割:
_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1], data.iloc[:, -1], test_size=0.2, random_state=42)
RFC的混淆矩阵是
在此输入图片描述
例如:第二类只有10个样本,为什么会这样,我应该进行平衡吗?
谢谢大家。
英文:
My whole dataset is 8 classes, 100 subjects for each class.
When I do the classification, I split it, the test set is imbalance.
I do split:
_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1], data.iloc[:, -1], test_size=0.2, random_state=42)
The confusion matrix of RFC is
enter image description here
eg: There's only 10 from the second class, why and should I balance it?
Thank you all.
答案1
得分: 2
train_test_split
函数来自 scikit-learn,以随机方式分割类别数。
要保持测试集和训练集中类别数量相等,您需要添加 "stratify" 参数。
_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1],
data.iloc[:, -1],
test_size=0.2,
random_state=42,
stratify=data.iloc[:, -1])
英文:
The train_test_split
function from scikit-learn splits classes number in random fashion.
To keep the number of classes equal in the test and train set you need to add the "stratify" argument.
_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1],
data.iloc[:, -1],
test_size=0.2,
random_state=42,
stratify=data.iloc[:, -1])
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论