为什么我在cross_val_score()中得分比实际测试中高得多?

huangapple go评论51阅读模式
英文:

Why I got a much higher score in cross_val_score() than in actual test?

问题

I've been using random forest in sklearn to predict a set of data, and the following code shows the output:

print(np.mean(cross_val_score(rf, X_train_resampled,
      y_train_resampled, cv=5, scoring='accuracy')))
print(balanced_accuracy_score(y_valid, predictions))

However, the cross_val_score method gives a 0.93 accuracy (which is obviously much higher than the actual test) while the balanced_accuracy_score gives a 0.40 accuracy.

I've been asking newbing and checking stackoverflow but got no good enough answers. Is it a problem occurring when the model is not good enough, or I have made something wrong?

英文:

I've been using random forest in sklearn to predict a set of data, and the following code shows the output:

print(np.mean(cross_val_score(rf, X_train_resampled,
      y_train_resampled, cv=5, scoring='accuracy')))
print(balanced_accuracy_score(y_valid, predictions))

However, the cross_val_score method gives a 0.93 accuracy (which is obviously much higher than the actual test) while the balanced_accuracy_score gives a 0.40 accuracy.

I've been asking newbing and checking stackoverflow but got no good enough answers. Is it a problem occuring when the model is not good enough, or I have made something wrong?

答案1

得分: 0

是的,您的模型不好。由于数据不平衡,他能够作弊。

例如,我创建了一个数据集,其中95%是class1,5%是class0。如果您在这个数据集上测试一个虚拟模型(总是返回1),您会得到以下结果:

import sklearn
from sklearn.metrics import accuracy_score, balanced_accuracy_score
import numpy as np
data = np.random.randn(100, 10)
labels = np.array(95*[1] + 5 * [0])
class model:
    
    def __init__(self):
        pass
    
    def predict(self, x):
        return np.ones(x.shape[0])
    
dummy_model = model()
print(accuracy_score(labels, dummy_model.predict(data)))
print(balanced_accuracy_score(labels, dummy_model.predict(data)))
英文:

Yes, your model is not good. He was able to cheat due to data imbalance.

For example, I created a dataset where 95% class1, 5% class0. If you test a dummy model( which always return 1) on this dataset, you get:

import sklearn
from sklearn.metrics import accuracy_score, balanced_accuracy_score
import numpy as np
data = np.random.randn(100, 10)
labels = np.array(95*[1] + 5 * [0])
class model:
    
    def __init__(self):
        pass
    
    def predict(self, x):
        return np.ones(x.shape[0])
    
dummy_model = model()
print(accuracy_score(labels, dummy_model.predict(data)))
print(balanced_accuracy_score(labels, dummy_model.predict(data)))

huangapple
  • 本文由 发表于 2023年4月19日 15:07:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76051627.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定