问题出现在对新数据进行评分时 — tidymodels

huangapple go评论61阅读模式
英文:

Problem when scoring new data -- tidymodels

问题

I'm learning tidymodels. The following code runs nicely:

library(tidyverse)
library(tidymodels)

# 从样本中随机抽取2000条数据以尝试模型

set.seed(1234)

diamonds <- diamonds %>%    
  sample_n(2000)
  
diamonds_split <- initial_split(diamonds, prop = 0.80, strata="price")

diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)

folds <- rsample::vfold_cv(diamonds_train, v = 10, strata="price")

metric <- metric_set(rmse,rsq,mae)

# 模型 KNN 

knn_spec <-
  nearest_neighbor(
    mode = "regression", 
    neighbors = tune("k"),
    engine = "kknn"
  ) 

knn_rec <-
  recipe(price ~ ., data = diamonds_train) %>%
  step_log(all_outcomes()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_dummy(all_nominal_predictors())

knn_wflow <- 
  workflow() %>% 
  add_model(knn_spec) %>%
  add_recipe(knn_rec)

knn_grid = expand.grid(k=c(1,5,10,30))

knn_res <- 
  tune_grid(
    knn_wflow,
    resamples = folds,
    metrics = metric,
    grid = knn_grid
  )

collect_metrics(knn_res)
autoplot(knn_res)

show_best(knn_res,metric="rmse")

# 最佳 KNN 模型

best_knn_spec <-
  nearest_neighbor(
    mode = "regression", 
    neighbors = 10,
    engine = "kknn"
  ) 

best_knn_wflow <- 
  workflow() %>% 
  add_model(best_knn_spec) %>%
  add_recipe(knn_rec)

best_knn_fit <- last_fit(best_knn_wflow, diamonds_split)

collect_metrics(best_knn_fit)

But when I try to fit the best model on the training set and applying it to the test set I run into problems. The following two lines give me the error: "Error in step_log():
! The following required column is missing from new_data in step 'log_mUSAb': price.
Run rlang::last_trace() to see where the error occurred."

# 手动预测

f1 = fit(best_knn_wflow,diamonds_train)
p1 = predict(f1,new_data=diamonds_test)
英文:

I'm learning tidymodels. The following code runs nicely:

library(tidyverse)
library(tidymodels)
# Draw a random sample of 2000 to try the models
set.seed(1234)
diamonds <- diamonds %>%    
sample_n(2000)
diamonds_split <- initial_split(diamonds, prop = 0.80, strata="price")
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
folds <- rsample::vfold_cv(diamonds_train, v = 10, strata="price")
metric <- metric_set(rmse,rsq,mae)
# Model KNN 
knn_spec <-
nearest_neighbor(
mode = "regression", 
neighbors = tune("k"),
engine = "kknn"
) 
knn_rec <-
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes()) %>% 
step_normalize(all_numeric_predictors()) %>% 
step_dummy(all_nominal_predictors())
knn_wflow <- 
workflow() %>% 
add_model(knn_spec) %>%
add_recipe(knn_rec)
knn_grid = expand.grid(k=c(1,5,10,30))
knn_res <- 
tune_grid(
knn_wflow,
resamples = folds,
metrics = metric,
grid = knn_grid
)
collect_metrics(knn_res)
autoplot(knn_res)
show_best(knn_res,metric="rmse")
# Best KNN 
best_knn_spec <-
nearest_neighbor(
mode = "regression", 
neighbors = 10,
engine = "kknn"
) 
best_knn_wflow <- 
workflow() %>% 
add_model(best_knn_spec) %>%
add_recipe(knn_rec)
best_knn_fit <- last_fit(best_knn_wflow, diamonds_split)
collect_metrics(best_knn_fit)

But when I try to fit the best model on the training set and applying it to the test set I run into problems. The following two lines give me the error : "Error in step_log():
! The following required column is missing from new_data in step 'log_mUSAb': price.
Run rlang::last_trace() to see where the error occurred."

# Predict Manually
f1 = fit(best_knn_wflow,diamonds_train)
p1 = predict(f1,new_data=diamonds_test)

答案1

得分: 1

这个问题与 https://stackoverflow.com/questions/76158409/log-transform-outcome-variable-in-tidymodels-workflow/76158558#76158558 相关。

对于对结果进行对数变换,我们强烈建议在将它们传递给 recipe() 之前进行这些变换。这是因为在预测时(也就是当您对工作流进行 last_fit() 操作时)不能保证会有结果变量。这会导致配方失败。

您在这里看到这个问题是因为当您对 workflow() 对象进行预测时,它只传递预测变量,因为那是它所需要的。这就是为什么您看到这个错误的原因。

由于对数变换不是一个学习到的变换,您可以放心地在之前进行它。

diamonds_train$price <- log(diamonds_train$price)

if (!is.null(diamonds_test$price)) {
  diamonds_test$price <- log(diamonds_test$price)
}
英文:

This problem is related to https://stackoverflow.com/questions/76158409/log-transform-outcome-variable-in-tidymodels-workflow/76158558#76158558

For log transformations to the outcome, we strongly recommend that those transformation be done before you pass them to the recipe(). This is because you are not guaranteed to have an outcome when predicting (which is what happens when you last_fit() a workflow) on new data. And the recipe fails.

You are seeing this here as when you predict on a workflow() object, it only passes the predictors, as it is all that it needs. Hence why you see this error.

Since log transformations isn't a learned transformation you can safely do it before.

diamonds_train$price &lt;- log(diamonds_train$price)

if (!is.null(diamonds_test$price)) {
  diamonds_test$price &lt;- log(diamonds_test$price)
}

huangapple
  • 本文由 发表于 2023年6月1日 10:51:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76378383.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定