Oob_score相关问题,从错误消息到实际使用方法。

huangapple go评论92阅读模式
英文:

Oob_score related questions from error messages to how to actually use it

问题

以下是您要翻译的内容:

这将是一篇很长的帖子。我有很多问题。我正在使用一个数据集(15000,5),加上3个类别的可靠变量(每个类别5000个数据点)。我使用StratifiedShuffleSplit将数据分成训练集(12000,)和测试集(3000,)。我通过randomsearchcv运行了一个随机森林分类器。

带有oob_score的错误消息

我使用oob_score = True运行随机森林,但是出现错误:“UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable OOB estimates.”我已经增加了n_estimator的数量,但没有用。我应该忽略它吗,还是有其他方法解决?

# 使用随机网格搜索寻找最佳超参数
param_grid = {'n_estimators':range(1,1000, 1),
 'min_samples_split' : [2, 5, 10],
 'max_depth':range(1, 500, 5),
 'min_samples_leaf' : [1, 2, 4],
 'bootstrap' : [True]}
RF = RandomForestClassifier(oob_score=True, 
                    random_state=42, 
                    warm_start=True,
                    n_jobs=-1)

# 随机搜索参数,使用3折交叉验证,搜索100种不同的组合,并使用所有可用的核心
rf_random = RandomizedSearchCV(estimator = RF, param_distributions = param_grid, n_iter = 100, cv = 3, verbose=2)
# 拟合随机搜索模型
rf_random.fit(X_train, y_train)
rf_random.best_params_
rf_random.best_score_, rf_random.best_estimator_.oob_score_

解释和使用oob_score

我得到了0.89的best_score_和0.90的best_estimator_.oob_score_oob_score_是90%的正确分类,还是(1-best_estimator_.oob_score_)意味着非常糟糕?

另外,由于我正在将这个模型与其他机器学习算法进行比较,我猜准确性是保持一致性的最佳方式,对吗?

我可以使用oob_score来测试过拟合和欠拟合吗?

英文:

This is going to be a long post. I have so many questions. I am using a data set of (15000,5) plus the dependable variable with 3 classes(5000 data points per class). I use the StratifiedShuffleSplit to divide the data into train(12000,)and test(3000,). I run a random forest classifier through randomsearchcv.

Error message with oob_score

I run the random forest with oob_score = True. But I get the error: "UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable OOB estimates." I have increased the number of n_estimator but nothing works. Do i just ignore it or what can I do?

# Use the random grid to search for best hyperparameters
  param_grid = {'n_estimators':range(1,1000, 1),
         'min_samples_split' : [2, 5, 10],
         'max_depth':range(1, 500, 5),
         'min_samples_leaf' : [1, 2, 4],
         'bootstrap' : [True]}
  RF = RandomForestClassifier(oob_score=True, 
                        random_state=42, 
                        warm_start=True,
                        n_jobs=-1)

  # Random search of parameters, using 3 fold cross validation, 
  # search across 100 different combinations, and use all available cores
  rf_random = RandomizedSearchCV(estimator = RF, param_distributions = param_grid, n_iter = 100, cv = 3, verbose=2)
  # Fit the random search model
  rf_random.fit(X_train, y_train)
  rf_random.best_params_
  rf_random.best_score_, rf_random.best_estimator_.oob_score_

Interpretting and using oob_score

I get a best_score_ of 0.89 and a best_estimator_.oob_score_ of 0.90. Is the oob_score_ 90% correct classification or (1-best_estimator_.oob_score_)which means really bad.

Also, since I am comparing this model to other ml algorythims I am guessing that the accuracy is the best way to be consistent, right?

What can I use the oob_score to test for overfitting and underfitting?

答案1

得分: 1

  • 我们使用随机森林来获得比仅使用单个决策树更稳健的模型。当使用范围为(1,1000)时,可能只会得到森林中的几棵树,与使用随机森林的思想相悖。您可以考虑将范围调整为具有更高下限的范围,例如range(50, 1000, 1)
  • 森林中的树数量增加可能确实会解决UserWarning
  • 0.9表示正确得分(不是1-0.9),这似乎相当不错。
  • oob分数和cv分数之间的相对小差异表明您的模型很可能没有过拟合
  • 关于度量标准,请参阅多类分类度量标准综述以了解在多类分类中比准确性更适合的度量标准
英文:
  • We use random forests to get more robust models compared to using just a single decision tree. When using a range of (1,1000), it is possible to get just a few trees in the forest, which is contrary to the idea of using random forests. You might consider adjusting the range to a range with a higher lower bound, e.g. range(50, 1000, 1)
  • A higher number of trees in the forest might indeed resolve the UserWarning
  • 0.9 represents the correct score (not 1-0.9), which seems to be quite good.
  • The relatively small difference between oob-score and cv-score indicates your model is most likely not overfitting
  • Regarding the metric, please see Comprehensive Guide to Multiclass Classification Metrics for metrics which are more suitable than accuracy for multiclass classification

huangapple
  • 本文由 发表于 2023年4月17日 15:32:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76032674.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定