英文:
Problem when scoring new data -- tidymodels
问题
I'm learning tidymodels. The following code runs nicely:
library(tidyverse)
library(tidymodels)
# 从样本中随机抽取2000条数据以尝试模型
set.seed(1234)
diamonds <- diamonds %>%
sample_n(2000)
diamonds_split <- initial_split(diamonds, prop = 0.80, strata="price")
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
folds <- rsample::vfold_cv(diamonds_train, v = 10, strata="price")
metric <- metric_set(rmse,rsq,mae)
# 模型 KNN
knn_spec <-
nearest_neighbor(
mode = "regression",
neighbors = tune("k"),
engine = "kknn"
)
knn_rec <-
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
knn_wflow <-
workflow() %>%
add_model(knn_spec) %>%
add_recipe(knn_rec)
knn_grid = expand.grid(k=c(1,5,10,30))
knn_res <-
tune_grid(
knn_wflow,
resamples = folds,
metrics = metric,
grid = knn_grid
)
collect_metrics(knn_res)
autoplot(knn_res)
show_best(knn_res,metric="rmse")
# 最佳 KNN 模型
best_knn_spec <-
nearest_neighbor(
mode = "regression",
neighbors = 10,
engine = "kknn"
)
best_knn_wflow <-
workflow() %>%
add_model(best_knn_spec) %>%
add_recipe(knn_rec)
best_knn_fit <- last_fit(best_knn_wflow, diamonds_split)
collect_metrics(best_knn_fit)
But when I try to fit the best model on the training set and applying it to the test set I run into problems. The following two lines give me the error: "Error in step_log():
! The following required column is missing from new_data in step 'log_mUSAb': price.
Run rlang::last_trace() to see where the error occurred."
# 手动预测
f1 = fit(best_knn_wflow,diamonds_train)
p1 = predict(f1,new_data=diamonds_test)
英文:
I'm learning tidymodels. The following code runs nicely:
library(tidyverse)
library(tidymodels)
# Draw a random sample of 2000 to try the models
set.seed(1234)
diamonds <- diamonds %>%
sample_n(2000)
diamonds_split <- initial_split(diamonds, prop = 0.80, strata="price")
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
folds <- rsample::vfold_cv(diamonds_train, v = 10, strata="price")
metric <- metric_set(rmse,rsq,mae)
# Model KNN
knn_spec <-
nearest_neighbor(
mode = "regression",
neighbors = tune("k"),
engine = "kknn"
)
knn_rec <-
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
knn_wflow <-
workflow() %>%
add_model(knn_spec) %>%
add_recipe(knn_rec)
knn_grid = expand.grid(k=c(1,5,10,30))
knn_res <-
tune_grid(
knn_wflow,
resamples = folds,
metrics = metric,
grid = knn_grid
)
collect_metrics(knn_res)
autoplot(knn_res)
show_best(knn_res,metric="rmse")
# Best KNN
best_knn_spec <-
nearest_neighbor(
mode = "regression",
neighbors = 10,
engine = "kknn"
)
best_knn_wflow <-
workflow() %>%
add_model(best_knn_spec) %>%
add_recipe(knn_rec)
best_knn_fit <- last_fit(best_knn_wflow, diamonds_split)
collect_metrics(best_knn_fit)
But when I try to fit the best model on the training set and applying it to the test set I run into problems. The following two lines give me the error : "Error in step_log():
! The following required column is missing from new_data in step 'log_mUSAb': price.
Run rlang::last_trace() to see where the error occurred."
# Predict Manually
f1 = fit(best_knn_wflow,diamonds_train)
p1 = predict(f1,new_data=diamonds_test)
答案1
得分: 1
对于对结果进行对数变换,我们强烈建议在将它们传递给 recipe() 之前进行这些变换。这是因为在预测时(也就是当您对工作流进行 last_fit() 操作时)不能保证会有结果变量。这会导致配方失败。
您在这里看到这个问题是因为当您对 workflow() 对象进行预测时,它只传递预测变量,因为那是它所需要的。这就是为什么您看到这个错误的原因。
由于对数变换不是一个学习到的变换,您可以放心地在之前进行它。
diamonds_train$price <- log(diamonds_train$price)
if (!is.null(diamonds_test$price)) {
diamonds_test$price <- log(diamonds_test$price)
}
英文:
This problem is related to https://stackoverflow.com/questions/76158409/log-transform-outcome-variable-in-tidymodels-workflow/76158558#76158558
For log transformations to the outcome, we strongly recommend that those transformation be done before you pass them to the recipe(). This is because you are not guaranteed to have an outcome when predicting (which is what happens when you last_fit() a workflow) on new data. And the recipe fails.
You are seeing this here as when you predict on a workflow() object, it only passes the predictors, as it is all that it needs. Hence why you see this error.
Since log transformations isn't a learned transformation you can safely do it before.
diamonds_train$price <- log(diamonds_train$price)
if (!is.null(diamonds_test$price)) {
diamonds_test$price <- log(diamonds_test$price)
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论