英文:
Problem when scoring new data -- tidymodels
问题
I'm learning tidymodels. The following code runs nicely:
library(tidyverse)
library(tidymodels)
# 从样本中随机抽取2000条数据以尝试模型
set.seed(1234)
diamonds <- diamonds %>%
sample_n(2000)
diamonds_split <- initial_split(diamonds, prop = 0.80, strata="price")
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
folds <- rsample::vfold_cv(diamonds_train, v = 10, strata="price")
metric <- metric_set(rmse,rsq,mae)
# 模型 KNN
knn_spec <-
nearest_neighbor(
mode = "regression",
neighbors = tune("k"),
engine = "kknn"
)
knn_rec <-
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
knn_wflow <-
workflow() %>%
add_model(knn_spec) %>%
add_recipe(knn_rec)
knn_grid = expand.grid(k=c(1,5,10,30))
knn_res <-
tune_grid(
knn_wflow,
resamples = folds,
metrics = metric,
grid = knn_grid
)
collect_metrics(knn_res)
autoplot(knn_res)
show_best(knn_res,metric="rmse")
# 最佳 KNN 模型
best_knn_spec <-
nearest_neighbor(
mode = "regression",
neighbors = 10,
engine = "kknn"
)
best_knn_wflow <-
workflow() %>%
add_model(best_knn_spec) %>%
add_recipe(knn_rec)
best_knn_fit <- last_fit(best_knn_wflow, diamonds_split)
collect_metrics(best_knn_fit)
But when I try to fit the best model on the training set and applying it to the test set I run into problems. The following two lines give me the error: "Error in step_log()
:
! The following required column is missing from new_data
in step 'log_mUSAb': price.
Run rlang::last_trace()
to see where the error occurred."
# 手动预测
f1 = fit(best_knn_wflow,diamonds_train)
p1 = predict(f1,new_data=diamonds_test)
英文:
I'm learning tidymodels. The following code runs nicely:
library(tidyverse)
library(tidymodels)
# Draw a random sample of 2000 to try the models
set.seed(1234)
diamonds <- diamonds %>%
sample_n(2000)
diamonds_split <- initial_split(diamonds, prop = 0.80, strata="price")
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
folds <- rsample::vfold_cv(diamonds_train, v = 10, strata="price")
metric <- metric_set(rmse,rsq,mae)
# Model KNN
knn_spec <-
nearest_neighbor(
mode = "regression",
neighbors = tune("k"),
engine = "kknn"
)
knn_rec <-
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
knn_wflow <-
workflow() %>%
add_model(knn_spec) %>%
add_recipe(knn_rec)
knn_grid = expand.grid(k=c(1,5,10,30))
knn_res <-
tune_grid(
knn_wflow,
resamples = folds,
metrics = metric,
grid = knn_grid
)
collect_metrics(knn_res)
autoplot(knn_res)
show_best(knn_res,metric="rmse")
# Best KNN
best_knn_spec <-
nearest_neighbor(
mode = "regression",
neighbors = 10,
engine = "kknn"
)
best_knn_wflow <-
workflow() %>%
add_model(best_knn_spec) %>%
add_recipe(knn_rec)
best_knn_fit <- last_fit(best_knn_wflow, diamonds_split)
collect_metrics(best_knn_fit)
But when I try to fit the best model on the training set and applying it to the test set I run into problems. The following two lines give me the error : "Error in step_log()
:
! The following required column is missing from new_data
in step 'log_mUSAb': price.
Run rlang::last_trace()
to see where the error occurred."
# Predict Manually
f1 = fit(best_knn_wflow,diamonds_train)
p1 = predict(f1,new_data=diamonds_test)
答案1
得分: 1
对于对结果进行对数变换,我们强烈建议在将它们传递给 recipe()
之前进行这些变换。这是因为在预测时(也就是当您对工作流进行 last_fit()
操作时)不能保证会有结果变量。这会导致配方失败。
您在这里看到这个问题是因为当您对 workflow()
对象进行预测时,它只传递预测变量,因为那是它所需要的。这就是为什么您看到这个错误的原因。
由于对数变换不是一个学习到的变换,您可以放心地在之前进行它。
diamonds_train$price <- log(diamonds_train$price)
if (!is.null(diamonds_test$price)) {
diamonds_test$price <- log(diamonds_test$price)
}
英文:
This problem is related to https://stackoverflow.com/questions/76158409/log-transform-outcome-variable-in-tidymodels-workflow/76158558#76158558
For log transformations to the outcome, we strongly recommend that those transformation be done before you pass them to the recipe()
. This is because you are not guaranteed to have an outcome when predicting (which is what happens when you last_fit()
a workflow) on new data. And the recipe fails.
You are seeing this here as when you predict on a workflow()
object, it only passes the predictors, as it is all that it needs. Hence why you see this error.
Since log transformations isn't a learned transformation you can safely do it before.
diamonds_train$price <- log(diamonds_train$price)
if (!is.null(diamonds_test$price)) {
diamonds_test$price <- log(diamonds_test$price)
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论