英文:
'Invalid number of trees' error with XGBoost and tidymodels
问题
我正在尝试预测患者对治疗的反应。当涉及到机器学习时,我是一个热情的业余爱好者,但通常我可以最终摸索出来。但这次对我来说有些困难。
我正在使用R 4.2.2和RStudio 2023.06.0+421以及R中的tidymodels
。我使用的是M1 MacBook Air。
不幸的是,我的数据集非常小。我已经使用它构建了逻辑回归、高斯朴素贝叶斯和C5.0决策树,但成功程度不一。但我认为至少值得尝试XGBoost。数据集包含了131个不同血液测试和呼吸机设置的观测值。经过预处理,每个观测值有39个变量。
我最初创建了一个重采样对象:
xgb_v_fold <-
vfold_cv(data = prone_session_1,
v = 5,
repeats = 5,
strata = mortality_28)
数据处理的配方如下:
xgb_recipe <-
recipe(prone_session_1, formula = mortality_28 ~ .) %>%
step_rm(patient_id,
bmi,
weight_kg,
fi_o2_supine,
pa_o2_supine) %>%
step_dummy(all_factor_predictors(), -mortality_28) %>%
step_impute_bag(all_predictors()) %>%
step_zv()
我已经设置了模型的超参数调优:
xgb_mod <-
boost_tree(mode = 'classification',
engine = 'xgboost',
mtry = tune(),
trees = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune(),
stop_iter = tune()
)
对于调优网格,我使用了dials
包中的默认设置,并补充了finalize()
命令的结果。
xgb_param_fin <- extract_parameter_set_dials(xgb_mod) %>%
finalize(juice(xgb_recipe))
xgb_grid <- grid_regular(mtry(range = c(1, 39)),
trees(),
min_n(),
tree_depth(range = c(1, 5)),
learn_rate(),
loss_reduction(),
sample_size(range = c(1, 1)),
stop_iter(),
levels = 10
)
将所有这些组合在一起得到一个工作流对象:
xgb_results <-
workflow() %>%
add_model(xgb_mod) %>%
add_recipe(xgb_recipe) %>%
tune_grid(resamples = xgb_v_fold,
grid = xgb_grid)
当我运行工作流时,我得到一系列重复且相似的错误消息。昨晚我让它运行了一整夜,以防万一,早上它仍然输出相同的消息,从未完成计算。下面是错误消息:
→ NA | error: ℹ In index: 2.
Caused by error in `predict.xgb.Booster()`:
! [07:19:55] src/gbm/gbtree.cc:549: Check failed: tree_end <= model_.trees.size() (223 vs. 7) : Invalid number of trees.
Stack trace:
[bt] (0) 1 xgboost.so 0x000000013a10bd3c dmlc::LogMessageFatal::~LogMessageFatal() + 124
[bt] (1) 2 xgboost.so 0x000000013a16f3b0 xgboost::gbm::GBTree::PredictBatch(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, unsigned int, unsigned int) + 496
[bt] (2) 3 xgboost.so 0x000000013a271434 xgboost::LearnerImpl::PredictRaw(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, unsigned int, unsigned int) const + 116
[bt] (3) 4 xgboost.so 0x000000013a261fb4 xgboost::LearnerImpl::Predict(std::__1::shared_ptr<xgboost::DMatrix>, bool, xgboost::HostDeviceVector<float>*, unsigned int, unsigned int, bool, bool, bool, bool, bool) + 628
[bt] (4) 5 xgboost.so 0x000000013a2ca9e0 XGBoosterPredictFromDMatrix + 800
[b
→
如何解决这个问题?
英文:
I am trying to predict what a patients response to a treatment will be. I am an ethusiastic amateur when it comes to machine learning, but I can usually muddle my way through eventually. This one is beyond me.
I am using R 4.2.2 and tidymodels
in RStudio 2023.06.0+421 on an M1 Macbook Air.
My dataset is, unfortunately, very small. I have used it to construct logistic regression, Gaussian Naive Bayes, and C5.0 decision trees with varying degrees of success, but I figure XGBoost is worth at least trying. The data consists of 131 observations of various blood tests and ventilator settings. After pre-processing, there are 39 variables for each observation.
I have initially created a resampled object:
xgb_v_fold <-
vfold_cv(data = prone_session_1,
v = 5,
repeats = 5,
strata = mortality_28)
The recipe for data processing is:
xgb_recipe <-
recipe(prone_session_1, formula = mortality_28 ~ .) %>%
step_rm(patient_id,
bmi,
weight_kg,
fi_o2_supine,
pa_o2_supine) %>%
step_dummy(all_factor_predictors(), -mortality_28) %>%
step_impute_bag(all_predictors()) %>%
step_zv()
I have set the model up for hyperparameter tuning:
xgb_mod <-
boost_tree(mode = 'classification',
engine = 'xgboost',
mtry = tune(),
trees = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune(),
stop_iter = tune()
)
For the tuning grid I have used a mix of default settings in the dials
package, and supplemented this with results from the finalize()
command.
xgb_param_fin <- extract_parameter_set_dials(xgb_mod) %>%
finalize(juice(xgb_recipe))
xgb_grid <- grid_regular(mtry(range = c(1, 39)),
trees(),
min_n(),
tree_depth(range = c(1, 5)),
learn_rate(),
loss_reduction(),
sample_size(range = c(1, 1)),
stop_iter(),
levels = 10
)
Combining all these gives a workflow object:
xgb_results <-
workflow() %>%
add_model(xgb_mod) %>%
add_recipe(xgb_recipe) %>%
tune_grid(resamples = xgb_v_fold,
grid = xgb_grid)
When I run the workflow I get a series of repeated and similar error messages. I ran it through the night last night just in case, and in the morning it was still spitting out the same message without ever concluding the calculations. This error message is below.
→ NA | error: ℹ In index: 2.
Caused by error in `predict.xgb.Booster()`:
! [07:19:55] src/gbm/gbtree.cc:549: Check failed: tree_end <= model_.trees.size() (223 vs. 7) : Invalid number of trees.
Stack trace:
[bt] (0) 1 xgboost.so 0x000000013a10bd3c dmlc::LogMessageFatal::~LogMessageFatal() + 124
[bt] (1) 2 xgboost.so 0x000000013a16f3b0 xgboost::gbm::GBTree::PredictBatch(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, unsigned int, unsigned int) + 496
[bt] (2) 3 xgboost.so 0x000000013a271434 xgboost::LearnerImpl::PredictRaw(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, unsigned int, unsigned int) const + 116
[bt] (3) 4 xgboost.so 0x000000013a261fb4 xgboost::LearnerImpl::Predict(std::__1::shared_ptr<xgboost::DMatrix>, bool, xgboost::HostDeviceVector<float>*, unsigned int, unsigned int, bool, bool, bool, bool, bool) + 628
[bt] (4) 5 xgboost.so 0x000000013a2ca9e0 XGBoosterPredictFromDMatrix + 800
[b
→
How to resolve this?
答案1
得分: 2
我能够复现这个问题,并且同意这个问题确实有些棘手!在以下代码中:
> Check failed: tree_end <= model_.trees.size() (223 vs. 7) : Invalid number of trees.
model_.trees.size()
是指 trees()
的值,tree_end
是在 1:trees()
中用于预测新观测的最后一棵树。XGBoost 表示无法使用比实际训练的迭代次数更多的树进行预测。
注释掉调用 trees()
的部分可以解决这些错误。这并不是对搜索空间大小的有效减少,因为在不同的重采样中调整 stop_iter
会导致树的数量不同。关于是否应将 trees()
视为调整参数存在一些争议。
我使用以下 reprex 复现了这个错误:
library(tidymodels)
mtcars <- tibble(mtcars[rep(1:32, 10),])
xgb_v_fold <-
vfold_cv(data = mtcars,
v = 5,
repeats = 5)
xgb_recipe <-
recipe(mtcars, formula = mpg ~ cyl + disp) %>%
step_dummy(all_factor_predictors()) %>%
step_impute_bag(all_predictors()) %>%
step_zv()
xgb_mod <-
boost_tree(mode = 'regression',
engine = 'xgboost',
mtry = tune(),
trees = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune(),
stop_iter = tune()
)
xgb_grid <- grid_regular(mtry(c(1, 5)),
trees(),
min_n(),
tree_depth(range = c(1, 5)),
learn_rate(),
loss_reduction(),
sample_size(range = c(1, 1)),
stop_iter(),
levels = 10
)
xgb_results <-
workflow() %>%
add_model(xgb_mod) %>%
add_recipe(xgb_recipe) %>%
tune_grid(resamples = xgb_v_fold,
grid = xgb_grid)
并通过以下方式解决:
xgb_mod <-
boost_tree(mode = 'regression',
engine = 'xgboost',
mtry = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune(),
stop_iter = tune()
)
xgb_grid <- grid_regular(mtry(c(1, 5)),
min_n(),
tree_depth(range = c(1, 5)),
learn_rate(),
loss_reduction(),
sample_size(range = c(1, 1)),
stop_iter(),
levels = 10
)
xgb_results <-
workflow() %>%
add_model(xgb_mod) %>%
add_recipe(xgb_recipe) %>%
tune_grid(resamples = xgb_v_fold,
grid = xgb_grid)
英文:
I was able to reproduce, and agree this one was tricky! In:
> Check failed: tree_end <= model_.trees.size() (223 vs. 7) : Invalid number of trees.
model_.trees.size()
refers to the value of trees()
, tree_end
being the last tree used in 1:trees()
used to predict on new observations. XGBoost is saying that it can't predict with trees in later iterations than were actually trained.
Commenting out the calls to tune trees()
resolves the errors. This isn't really an effective reduction in the size of your search space, as tuning over stop_iter
across resamples will result in varying numbers of trees. There is some debate as to whether trees()
ought to be regarded as a tuning parameter.
I used the following reprex to reproduce the error:
library(tidymodels)
mtcars <- tibble(mtcars[rep(1:32, 10),])
xgb_v_fold <-
vfold_cv(data = mtcars,
v = 5,
repeats = 5)
xgb_recipe <-
recipe(mtcars, formula = mpg ~ cyl + disp) %>%
step_dummy(all_factor_predictors()) %>%
step_impute_bag(all_predictors()) %>%
step_zv()
xgb_mod <-
boost_tree(mode = 'regression',
engine = 'xgboost',
mtry = tune(),
trees = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune(),
stop_iter = tune()
)
xgb_grid <- grid_regular(mtry(c(1, 5)),
trees(),
min_n(),
tree_depth(range = c(1, 5)),
learn_rate(),
loss_reduction(),
sample_size(range = c(1, 1)),
stop_iter(),
levels = 10
)
xgb_results <-
workflow() %>%
add_model(xgb_mod) %>%
add_recipe(xgb_recipe) %>%
tune_grid(resamples = xgb_v_fold,
grid = xgb_grid)
and resolved by writing:
xgb_mod <-
boost_tree(mode = 'regression',
engine = 'xgboost',
mtry = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune(),
stop_iter = tune()
)
xgb_grid <- grid_regular(mtry(c(1, 5)),
min_n(),
tree_depth(range = c(1, 5)),
learn_rate(),
loss_reduction(),
sample_size(range = c(1, 1)),
stop_iter(),
levels = 10
)
xgb_results <-
workflow() %>%
add_model(xgb_mod) %>%
add_recipe(xgb_recipe) %>%
tune_grid(resamples = xgb_v_fold,
grid = xgb_grid)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论