使用XGBoost和tidymodels时出现“无效的树数量”错误。

huangapple go评论70阅读模式
英文:

'Invalid number of trees' error with XGBoost and tidymodels

问题

我正在尝试预测患者对治疗的反应。当涉及到机器学习时,我是一个热情的业余爱好者,但通常我可以最终摸索出来。但这次对我来说有些困难。

我正在使用R 4.2.2和RStudio 2023.06.0+421以及R中的tidymodels。我使用的是M1 MacBook Air。

不幸的是,我的数据集非常小。我已经使用它构建了逻辑回归、高斯朴素贝叶斯和C5.0决策树,但成功程度不一。但我认为至少值得尝试XGBoost。数据集包含了131个不同血液测试和呼吸机设置的观测值。经过预处理,每个观测值有39个变量。

我最初创建了一个重采样对象:

xgb_v_fold <- 
  vfold_cv(data = prone_session_1,
           v = 5, 
           repeats = 5, 
           strata = mortality_28)

数据处理的配方如下:

xgb_recipe <- 
  recipe(prone_session_1, formula = mortality_28 ~ .) %>% 
  step_rm(patient_id,
          bmi,
          weight_kg, 
          fi_o2_supine,
          pa_o2_supine) %>% 
  step_dummy(all_factor_predictors(), -mortality_28) %>% 
  step_impute_bag(all_predictors()) %>% 
  step_zv()

我已经设置了模型的超参数调优:

xgb_mod <- 
  boost_tree(mode = 'classification',
             engine = 'xgboost',
             mtry = tune(),
             trees = tune(),
             min_n = tune(),
             tree_depth = tune(),
             learn_rate = tune(),
             loss_reduction = tune(),
             sample_size = tune(),
             stop_iter = tune()
             )

对于调优网格,我使用了dials包中的默认设置,并补充了finalize()命令的结果。

xgb_param_fin <- extract_parameter_set_dials(xgb_mod) %>% 
  finalize(juice(xgb_recipe))

xgb_grid <- grid_regular(mtry(range = c(1, 39)),
                         trees(),
                         min_n(),
                         tree_depth(range = c(1, 5)),
                         learn_rate(),
                         loss_reduction(),
                         sample_size(range = c(1, 1)),
                         stop_iter(), 
                         levels = 10
                         )

将所有这些组合在一起得到一个工作流对象:

xgb_results <- 
  workflow() %>% 
  add_model(xgb_mod) %>% 
  add_recipe(xgb_recipe) %>% 
  tune_grid(resamples = xgb_v_fold,
            grid = xgb_grid)

当我运行工作流时,我得到一系列重复且相似的错误消息。昨晚我让它运行了一整夜,以防万一,早上它仍然输出相同的消息,从未完成计算。下面是错误消息:

NA | error:   ℹ In index: 2.                                                                                       
                  Caused by error in `predict.xgb.Booster()`:
                  ! [07:19:55] src/gbm/gbtree.cc:549: Check failed: tree_end <= model_.trees.size() (223 vs. 7) : Invalid number of trees.
                  Stack trace:
                    [bt] (0) 1   xgboost.so                          0x000000013a10bd3c dmlc::LogMessageFatal::~LogMessageFatal() + 124
                    [bt] (1) 2   xgboost.so                          0x000000013a16f3b0 xgboost::gbm::GBTree::PredictBatch(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, unsigned int, unsigned int) + 496
                    [bt] (2) 3   xgboost.so                          0x000000013a271434 xgboost::LearnerImpl::PredictRaw(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, unsigned int, unsigned int) const + 116
                    [bt] (3) 4   xgboost.so                          0x000000013a261fb4 xgboost::LearnerImpl::Predict(std::__1::shared_ptr<xgboost::DMatrix>, bool, xgboost::HostDeviceVector<float>*, unsigned int, unsigned int, bool, bool, bool, bool, bool) + 628
                    [bt] (4) 5   xgboost.so                          0x000000013a2ca9e0 XGBoosterPredictFromDMatrix + 800
                    [b

如何解决这个问题?

英文:

I am trying to predict what a patients response to a treatment will be. I am an ethusiastic amateur when it comes to machine learning, but I can usually muddle my way through eventually. This one is beyond me.

I am using R 4.2.2 and tidymodels in RStudio 2023.06.0+421 on an M1 Macbook Air.

My dataset is, unfortunately, very small. I have used it to construct logistic regression, Gaussian Naive Bayes, and C5.0 decision trees with varying degrees of success, but I figure XGBoost is worth at least trying. The data consists of 131 observations of various blood tests and ventilator settings. After pre-processing, there are 39 variables for each observation.

I have initially created a resampled object:

xgb_v_fold &lt;- 
  vfold_cv(data = prone_session_1,
           v = 5, 
           repeats = 5, 
           strata = mortality_28)

The recipe for data processing is:

xgb_recipe &lt;- 
  recipe(prone_session_1, formula = mortality_28 ~ .) %&gt;% 
  step_rm(patient_id,
          bmi,
          weight_kg, 
          fi_o2_supine,
          pa_o2_supine) %&gt;% 
  step_dummy(all_factor_predictors(), -mortality_28) %&gt;% 
  step_impute_bag(all_predictors()) %&gt;% 
  step_zv()

I have set the model up for hyperparameter tuning:

xgb_mod &lt;- 
  boost_tree(mode = &#39;classification&#39;,
             engine = &#39;xgboost&#39;,
             mtry = tune(),
             trees = tune(),
             min_n = tune(),
             tree_depth = tune(),
             learn_rate = tune(),
             loss_reduction = tune(),
             sample_size = tune(),
             stop_iter = tune()
             )

For the tuning grid I have used a mix of default settings in the dials package, and supplemented this with results from the finalize() command.

xgb_param_fin &lt;- extract_parameter_set_dials(xgb_mod) %&gt;% 
  finalize(juice(xgb_recipe))

xgb_grid &lt;- grid_regular(mtry(range = c(1, 39)),
                         trees(),
                         min_n(),
                         tree_depth(range = c(1, 5)),
                         learn_rate(),
                         loss_reduction(),
                         sample_size(range = c(1, 1)),
                         stop_iter(), 
                         levels = 10
                         )

Combining all these gives a workflow object:

xgb_results &lt;- 
  workflow() %&gt;% 
  add_model(xgb_mod) %&gt;% 
  add_recipe(xgb_recipe) %&gt;% 
  tune_grid(resamples = xgb_v_fold,
            grid = xgb_grid)

When I run the workflow I get a series of repeated and similar error messages. I ran it through the night last night just in case, and in the morning it was still spitting out the same message without ever concluding the calculations. This error message is below.

→ NA | error:   ℹ In index: 2.                                                                                       
                  Caused by error in `predict.xgb.Booster()`:
                  ! [07:19:55] src/gbm/gbtree.cc:549: Check failed: tree_end &lt;= model_.trees.size() (223 vs. 7) : Invalid number of trees.
                  Stack trace:
                    [bt] (0) 1   xgboost.so                          0x000000013a10bd3c dmlc::LogMessageFatal::~LogMessageFatal() + 124
                    [bt] (1) 2   xgboost.so                          0x000000013a16f3b0 xgboost::gbm::GBTree::PredictBatch(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, unsigned int, unsigned int) + 496
                    [bt] (2) 3   xgboost.so                          0x000000013a271434 xgboost::LearnerImpl::PredictRaw(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, unsigned int, unsigned int) const + 116
                    [bt] (3) 4   xgboost.so                          0x000000013a261fb4 xgboost::LearnerImpl::Predict(std::__1::shared_ptr&lt;xgboost::DMatrix&gt;, bool, xgboost::HostDeviceVector&lt;float&gt;*, unsigned int, unsigned int, bool, bool, bool, bool, bool) + 628
                    [bt] (4) 5   xgboost.so                          0x000000013a2ca9e0 XGBoosterPredictFromDMatrix + 800
                    [b
→ 

How to resolve this?

答案1

得分: 2

我能够复现这个问题,并且同意这个问题确实有些棘手!在以下代码中:

> Check failed: tree_end <= model_.trees.size() (223 vs. 7) : Invalid number of trees.

model_.trees.size() 是指 trees() 的值,tree_end 是在 1:trees() 中用于预测新观测的最后一棵树。XGBoost 表示无法使用比实际训练的迭代次数更多的树进行预测。

注释掉调用 trees() 的部分可以解决这些错误。这并不是对搜索空间大小的有效减少,因为在不同的重采样中调整 stop_iter 会导致树的数量不同。关于是否应将 trees() 视为调整参数存在一些争议。

我使用以下 reprex 复现了这个错误:

library(tidymodels)

mtcars <- tibble(mtcars[rep(1:32, 10),])

xgb_v_fold <- 
  vfold_cv(data = mtcars,
           v = 5, 
           repeats = 5)

xgb_recipe <- 
  recipe(mtcars, formula = mpg ~ cyl + disp) %>%
  step_dummy(all_factor_predictors()) %>%
  step_impute_bag(all_predictors()) %>%
  step_zv()

xgb_mod <- 
  boost_tree(mode = 'regression',
             engine = 'xgboost',
             mtry = tune(),
             trees = tune(),
             min_n = tune(),
             tree_depth = tune(),
             learn_rate = tune(),
             loss_reduction = tune(),
             sample_size = tune(),
             stop_iter = tune()
  )

xgb_grid <- grid_regular(mtry(c(1, 5)),
                         trees(),
                         min_n(),
                         tree_depth(range = c(1, 5)),
                         learn_rate(),
                         loss_reduction(),
                         sample_size(range = c(1, 1)),
                         stop_iter(), 
                         levels = 10
)

xgb_results <- 
  workflow() %>%
  add_model(xgb_mod) %>%
  add_recipe(xgb_recipe) %>%
  tune_grid(resamples = xgb_v_fold,
            grid = xgb_grid)

并通过以下方式解决:

xgb_mod <- 
  boost_tree(mode = 'regression',
             engine = 'xgboost',
             mtry = tune(),
             min_n = tune(),
             tree_depth = tune(),
             learn_rate = tune(),
             loss_reduction = tune(),
             sample_size = tune(),
             stop_iter = tune()
  )

xgb_grid <- grid_regular(mtry(c(1, 5)),
                         min_n(),
                         tree_depth(range = c(1, 5)),
                         learn_rate(),
                         loss_reduction(),
                         sample_size(range = c(1, 1)),
                         stop_iter(), 
                         levels = 10
)

xgb_results <- 
  workflow() %>%
  add_model(xgb_mod) %>%
  add_recipe(xgb_recipe) %>%
  tune_grid(resamples = xgb_v_fold,
            grid = xgb_grid)
英文:

I was able to reproduce, and agree this one was tricky! In:

> Check failed: tree_end <= model_.trees.size() (223 vs. 7) : Invalid number of trees.

model_.trees.size() refers to the value of trees(), tree_end being the last tree used in 1:trees() used to predict on new observations. XGBoost is saying that it can't predict with trees in later iterations than were actually trained.

Commenting out the calls to tune trees() resolves the errors. This isn't really an effective reduction in the size of your search space, as tuning over stop_iter across resamples will result in varying numbers of trees. There is some debate as to whether trees() ought to be regarded as a tuning parameter.

I used the following reprex to reproduce the error:

library(tidymodels)

mtcars &lt;- tibble(mtcars[rep(1:32, 10),])

xgb_v_fold &lt;- 
  vfold_cv(data = mtcars,
           v = 5, 
           repeats = 5)

xgb_recipe &lt;- 
  recipe(mtcars, formula = mpg ~ cyl + disp) %&gt;% 
  step_dummy(all_factor_predictors()) %&gt;% 
  step_impute_bag(all_predictors()) %&gt;% 
  step_zv()

xgb_mod &lt;- 
  boost_tree(mode = &#39;regression&#39;,
             engine = &#39;xgboost&#39;,
             mtry = tune(),
             trees = tune(),
             min_n = tune(),
             tree_depth = tune(),
             learn_rate = tune(),
             loss_reduction = tune(),
             sample_size = tune(),
             stop_iter = tune()
  )

xgb_grid &lt;- grid_regular(mtry(c(1, 5)),
                         trees(),
                         min_n(),
                         tree_depth(range = c(1, 5)),
                         learn_rate(),
                         loss_reduction(),
                         sample_size(range = c(1, 1)),
                         stop_iter(), 
                         levels = 10
)

xgb_results &lt;- 
  workflow() %&gt;% 
  add_model(xgb_mod) %&gt;% 
  add_recipe(xgb_recipe) %&gt;% 
  tune_grid(resamples = xgb_v_fold,
            grid = xgb_grid)

and resolved by writing:

xgb_mod &lt;- 
  boost_tree(mode = &#39;regression&#39;,
             engine = &#39;xgboost&#39;,
             mtry = tune(),
             min_n = tune(),
             tree_depth = tune(),
             learn_rate = tune(),
             loss_reduction = tune(),
             sample_size = tune(),
             stop_iter = tune()
  )

xgb_grid &lt;- grid_regular(mtry(c(1, 5)),
                         min_n(),
                         tree_depth(range = c(1, 5)),
                         learn_rate(),
                         loss_reduction(),
                         sample_size(range = c(1, 1)),
                         stop_iter(), 
                         levels = 10
)

xgb_results &lt;- 
  workflow() %&gt;% 
  add_model(xgb_mod) %&gt;% 
  add_recipe(xgb_recipe) %&gt;% 
  tune_grid(resamples = xgb_v_fold,
            grid = xgb_grid)

huangapple
  • 本文由 发表于 2023年7月27日 15:26:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76777396.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定