英文:
Why I got a much higher score in cross_val_score() than in actual test?
问题
I've been using random forest in sklearn to predict a set of data, and the following code shows the output:
print(np.mean(cross_val_score(rf, X_train_resampled,
y_train_resampled, cv=5, scoring='accuracy')))
print(balanced_accuracy_score(y_valid, predictions))
However, the cross_val_score method gives a 0.93 accuracy (which is obviously much higher than the actual test) while the balanced_accuracy_score gives a 0.40 accuracy.
I've been asking newbing and checking stackoverflow but got no good enough answers. Is it a problem occurring when the model is not good enough, or I have made something wrong?
英文:
I've been using random forest in sklearn to predict a set of data, and the following code shows the output:
print(np.mean(cross_val_score(rf, X_train_resampled,
y_train_resampled, cv=5, scoring='accuracy')))
print(balanced_accuracy_score(y_valid, predictions))
However, the cross_val_score method gives a 0.93 accuracy (which is obviously much higher than the actual test) while the balanced_accuracy_score gives a 0.40 accuracy.
I've been asking newbing and checking stackoverflow but got no good enough answers. Is it a problem occuring when the model is not good enough, or I have made something wrong?
答案1
得分: 0
是的,您的模型不好。由于数据不平衡,他能够作弊。
例如,我创建了一个数据集,其中95%是class1,5%是class0。如果您在这个数据集上测试一个虚拟模型(总是返回1),您会得到以下结果:
import sklearn
from sklearn.metrics import accuracy_score, balanced_accuracy_score
import numpy as np
data = np.random.randn(100, 10)
labels = np.array(95*[1] + 5 * [0])
class model:
def __init__(self):
pass
def predict(self, x):
return np.ones(x.shape[0])
dummy_model = model()
print(accuracy_score(labels, dummy_model.predict(data)))
print(balanced_accuracy_score(labels, dummy_model.predict(data)))
英文:
Yes, your model is not good. He was able to cheat due to data imbalance.
For example, I created a dataset where 95% class1, 5% class0. If you test a dummy model( which always return 1) on this dataset, you get:
import sklearn
from sklearn.metrics import accuracy_score, balanced_accuracy_score
import numpy as np
data = np.random.randn(100, 10)
labels = np.array(95*[1] + 5 * [0])
class model:
def __init__(self):
pass
def predict(self, x):
return np.ones(x.shape[0])
dummy_model = model()
print(accuracy_score(labels, dummy_model.predict(data)))
print(balanced_accuracy_score(labels, dummy_model.predict(data)))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论