问题

以下是您要翻译的内容：

这将是一篇很长的帖子。我有很多问题。我正在使用一个数据集（15000,5），加上3个类别的可靠变量（每个类别5000个数据点）。我使用StratifiedShuffleSplit将数据分成训练集（12000,）和测试集（3000,）。我通过randomsearchcv运行了一个随机森林分类器。

带有oob_score的错误消息

我使用oob_score = True运行随机森林，但是出现错误：“UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable OOB estimates.”我已经增加了n_estimator的数量，但没有用。我应该忽略它吗，还是有其他方法解决？

# 使用随机网格搜索寻找最佳超参数
param_grid = {'n_estimators':range(1,1000, 1),
 'min_samples_split' : [2, 5, 10],
 'max_depth':range(1, 500, 5),
 'min_samples_leaf' : [1, 2, 4],
 'bootstrap' : [True]}
RF = RandomForestClassifier(oob_score=True, 
                    random_state=42, 
                    warm_start=True,
                    n_jobs=-1)

# 随机搜索参数，使用3折交叉验证，搜索100种不同的组合，并使用所有可用的核心
rf_random = RandomizedSearchCV(estimator = RF, param_distributions = param_grid, n_iter = 100, cv = 3, verbose=2)
# 拟合随机搜索模型
rf_random.fit(X_train, y_train)
rf_random.best_params_
rf_random.best_score_, rf_random.best_estimator_.oob_score_

解释和使用oob_score

我得到了0.89的best_score_和0.90的best_estimator_.oob_score_。oob_score_是90%的正确分类，还是（1-best_estimator_.oob_score_）意味着非常糟糕？

另外，由于我正在将这个模型与其他机器学习算法进行比较，我猜准确性是保持一致性的最佳方式，对吗？

我可以使用oob_score来测试过拟合和欠拟合吗？

英文:

This is going to be a long post. I have so many questions. I am using a data set of (15000,5) plus the dependable variable with 3 classes(5000 data points per class). I use the StratifiedShuffleSplit to divide the data into train(12000,)and test(3000,). I run a random forest classifier through randomsearchcv.

Error message with oob_score

I run the random forest with oob_score = True. But I get the error: "UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable OOB estimates." I have increased the number of n_estimator but nothing works. Do i just ignore it or what can I do?

# Use the random grid to search for best hyperparameters
  param_grid = {&#39;n_estimators&#39;:range(1,1000, 1),
         &#39;min_samples_split&#39; : [2, 5, 10],
         &#39;max_depth&#39;:range(1, 500, 5),
         &#39;min_samples_leaf&#39; : [1, 2, 4],
         &#39;bootstrap&#39; : [True]}
  RF = RandomForestClassifier(oob_score=True, 
                        random_state=42, 
                        warm_start=True,
                        n_jobs=-1)

  # Random search of parameters, using 3 fold cross validation, 
  # search across 100 different combinations, and use all available cores
  rf_random = RandomizedSearchCV(estimator = RF, param_distributions = param_grid, n_iter = 100, cv = 3, verbose=2)
  # Fit the random search model
  rf_random.fit(X_train, y_train)
  rf_random.best_params_
  rf_random.best_score_, rf_random.best_estimator_.oob_score_

Interpretting and using oob_score

I get a best_score_ of 0.89 and a best_estimator_.oob_score_ of 0.90. Is the oob_score_ 90% correct classification or (1-best_estimator_.oob_score_)which means really bad.

Also, since I am comparing this model to other ml algorythims I am guessing that the accuracy is the best way to be consistent, right?

What can I use the oob_score to test for overfitting and underfitting?

答案1

得分: 1

我们使用随机森林来获得比仅使用单个决策树更稳健的模型。当使用范围为(1,1000)时，可能只会得到森林中的几棵树，与使用随机森林的思想相悖。您可以考虑将范围调整为具有更高下限的范围，例如range(50, 1000, 1)
森林中的树数量增加可能确实会解决UserWarning
0.9表示正确得分（不是1-0.9），这似乎相当不错。
oob分数和cv分数之间的相对小差异表明您的模型很可能没有过拟合
关于度量标准，请参阅多类分类度量标准综述以了解在多类分类中比准确性更适合的度量标准

英文:

We use random forests to get more robust models compared to using just a single decision tree. When using a range of (1,1000), it is possible to get just a few trees in the forest, which is contrary to the idea of using random forests. You might consider adjusting the range to a range with a higher lower bound, e.g. range(50, 1000, 1)
A higher number of trees in the forest might indeed resolve the UserWarning
0.9 represents the correct score (not 1-0.9), which seems to be quite good.
The relatively small difference between oob-score and cv-score indicates your model is most likely not overfitting
Regarding the metric, please see Comprehensive Guide to Multiclass Classification Metrics for metrics which are more suitable than accuracy for multiclass classification

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Oob_score相关问题，从错误消息到实际使用方法。

问题

答案1

Sklearn随机森林：确定模型拟合和预测所确定的特征名称。

为什么我在cross_val_score()中得分比实际测试中高得多？

绘制机器学习预测模型的回归线，使用Python。

TensorFlow随机森林绘图错误

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论