2023年4月19日 15:07:03go评论78阅读模式

英文:

Why I got a much higher score in cross_val_score() than in actual test?

问题

I've been using random forest in sklearn to predict a set of data, and the following code shows the output:

print(np.mean(cross_val_score(rf, X_train_resampled,
      y_train_resampled, cv=5, scoring='accuracy')))
print(balanced_accuracy_score(y_valid, predictions))

However, the cross_val_score method gives a 0.93 accuracy (which is obviously much higher than the actual test) while the balanced_accuracy_score gives a 0.40 accuracy.

I've been asking newbing and checking stackoverflow but got no good enough answers. Is it a problem occurring when the model is not good enough, or I have made something wrong?

英文:

I've been using random forest in sklearn to predict a set of data, and the following code shows the output:

print(np.mean(cross_val_score(rf, X_train_resampled,
      y_train_resampled, cv=5, scoring=&#39;accuracy&#39;)))
print(balanced_accuracy_score(y_valid, predictions))

However, the cross_val_score method gives a 0.93 accuracy (which is obviously much higher than the actual test) while the balanced_accuracy_score gives a 0.40 accuracy.

I've been asking newbing and checking stackoverflow but got no good enough answers. Is it a problem occuring when the model is not good enough, or I have made something wrong?

答案1

得分: 0

是的，您的模型不好。由于数据不平衡，他能够作弊。

例如，我创建了一个数据集，其中95%是class1，5%是class0。如果您在这个数据集上测试一个虚拟模型（总是返回1），您会得到以下结果：

import sklearn
from sklearn.metrics import accuracy_score, balanced_accuracy_score
import numpy as np
data = np.random.randn(100, 10)
labels = np.array(95*[1] + 5 * [0])
class model:
    
    def __init__(self):
        pass
    
    def predict(self, x):
        return np.ones(x.shape[0])
    
dummy_model = model()
print(accuracy_score(labels, dummy_model.predict(data)))
print(balanced_accuracy_score(labels, dummy_model.predict(data)))

英文:

Yes, your model is not good. He was able to cheat due to data imbalance.

For example, I created a dataset where 95% class1, 5% class0. If you test a dummy model( which always return 1) on this dataset, you get:

import sklearn
from sklearn.metrics import accuracy_score, balanced_accuracy_score
import numpy as np
data = np.random.randn(100, 10)
labels = np.array(95*[1] + 5 * [0])
class model:
    
    def __init__(self):
        pass
    
    def predict(self, x):
        return np.ones(x.shape[0])
    
dummy_model = model()
print(accuracy_score(labels, dummy_model.predict(data)))
print(balanced_accuracy_score(labels, dummy_model.predict(data)))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么我在cross_val_score()中得分比实际测试中高得多？

问题

答案1

数据摄入 – 类型错误：无法解包非可迭代的 NoneType 对象

计算当前时间所属的15分钟时间段，使用go语言。

使用GammaRegressor()进行拟合，并获取比例和形状参数。

如何在sklearn的Pipeline中删除并更改dtype？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。