英文:
Oob_score related questions from error messages to how to actually use it
问题
以下是您要翻译的内容:
这将是一篇很长的帖子。我有很多问题。我正在使用一个数据集(15000,5),加上3个类别的可靠变量(每个类别5000个数据点)。我使用StratifiedShuffleSplit
将数据分成训练集(12000,)和测试集(3000,)。我通过randomsearchcv
运行了一个随机森林分类器。
带有oob_score的错误消息
我使用oob_score = True
运行随机森林,但是出现错误:“UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable OOB estimates.
”我已经增加了n_estimator
的数量,但没有用。我应该忽略它吗,还是有其他方法解决?
# 使用随机网格搜索寻找最佳超参数
param_grid = {'n_estimators':range(1,1000, 1),
'min_samples_split' : [2, 5, 10],
'max_depth':range(1, 500, 5),
'min_samples_leaf' : [1, 2, 4],
'bootstrap' : [True]}
RF = RandomForestClassifier(oob_score=True,
random_state=42,
warm_start=True,
n_jobs=-1)
# 随机搜索参数,使用3折交叉验证,搜索100种不同的组合,并使用所有可用的核心
rf_random = RandomizedSearchCV(estimator = RF, param_distributions = param_grid, n_iter = 100, cv = 3, verbose=2)
# 拟合随机搜索模型
rf_random.fit(X_train, y_train)
rf_random.best_params_
rf_random.best_score_, rf_random.best_estimator_.oob_score_
解释和使用oob_score
我得到了0.89的best_score_
和0.90的best_estimator_.oob_score_
。oob_score_
是90%的正确分类,还是(1-best_estimator_.oob_score_
)意味着非常糟糕?
另外,由于我正在将这个模型与其他机器学习算法进行比较,我猜准确性是保持一致性的最佳方式,对吗?
我可以使用oob_score
来测试过拟合和欠拟合吗?
英文:
This is going to be a long post. I have so many questions. I am using a data set of (15000,5) plus the dependable variable with 3 classes(5000 data points per class). I use the StratifiedShuffleSplit
to divide the data into train(12000,)and test(3000,). I run a random forest classifier through randomsearchcv
.
Error message with oob_score
I run the random forest with oob_score = True
. But I get the error: "UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable OOB estimates.
" I have increased the number of n_estimator
but nothing works. Do i just ignore it or what can I do?
# Use the random grid to search for best hyperparameters
param_grid = {'n_estimators':range(1,1000, 1),
'min_samples_split' : [2, 5, 10],
'max_depth':range(1, 500, 5),
'min_samples_leaf' : [1, 2, 4],
'bootstrap' : [True]}
RF = RandomForestClassifier(oob_score=True,
random_state=42,
warm_start=True,
n_jobs=-1)
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = RF, param_distributions = param_grid, n_iter = 100, cv = 3, verbose=2)
# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_
rf_random.best_score_, rf_random.best_estimator_.oob_score_
Interpretting and using oob_score
I get a best_score_
of 0.89 and a best_estimator_.oob_score_
of 0.90. Is the oob_score_
90% correct classification or (1-best_estimator_.oob_score_
)which means really bad.
Also, since I am comparing this model to other ml algorythims I am guessing that the accuracy is the best way to be consistent, right?
What can I use the oob_score
to test for overfitting and underfitting?
答案1
得分: 1
- 我们使用随机森林来获得比仅使用单个决策树更稳健的模型。当使用范围为
(1,1000)
时,可能只会得到森林中的几棵树,与使用随机森林的思想相悖。您可以考虑将范围调整为具有更高下限的范围,例如range(50, 1000, 1)
- 森林中的树数量增加可能确实会解决
UserWarning
- 0.9表示正确得分(不是1-0.9),这似乎相当不错。
- oob分数和cv分数之间的相对小差异表明您的模型很可能没有过拟合
- 关于度量标准,请参阅多类分类度量标准综述以了解在多类分类中比准确性更适合的度量标准
英文:
- We use random forests to get more robust models compared to using just a single decision tree. When using a range of
(1,1000)
, it is possible to get just a few trees in the forest, which is contrary to the idea of using random forests. You might consider adjusting the range to a range with a higher lower bound, e.g.range(50, 1000, 1)
- A higher number of trees in the forest might indeed resolve the
UserWarning
- 0.9 represents the correct score (not 1-0.9), which seems to be quite good.
- The relatively small difference between oob-score and cv-score indicates your model is most likely not overfitting
- Regarding the metric, please see Comprehensive Guide to Multiclass Classification Metrics for metrics which are more suitable than accuracy for multiclass classification
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论