2023年6月1日 10:51:27go评论92阅读模式

英文:

Problem when scoring new data -- tidymodels

问题

I'm learning tidymodels. The following code runs nicely:

library(tidyverse)
library(tidymodels)
# 从样本中随机抽取2000条数据以尝试模型
set.seed(1234)
diamonds &lt;- diamonds %&gt;%    
  sample_n(2000)
  
diamonds_split &lt;- initial_split(diamonds, prop = 0.80, strata=&quot;price&quot;)
diamonds_train &lt;- training(diamonds_split)
diamonds_test &lt;- testing(diamonds_split)
folds &lt;- rsample::vfold_cv(diamonds_train, v = 10, strata=&quot;price&quot;)
metric &lt;- metric_set(rmse,rsq,mae)
# 模型 KNN 
knn_spec &lt;-
  nearest_neighbor(
    mode = &quot;regression&quot;, 
    neighbors = tune(&quot;k&quot;),
    engine = &quot;kknn&quot;
  ) 
knn_rec &lt;-
  recipe(price ~ ., data = diamonds_train) %&gt;%
  step_log(all_outcomes()) %&gt;% 
  step_normalize(all_numeric_predictors()) %&gt;% 
  step_dummy(all_nominal_predictors())
knn_wflow &lt;- 
  workflow() %&gt;% 
  add_model(knn_spec) %&gt;%
  add_recipe(knn_rec)
knn_grid = expand.grid(k=c(1,5,10,30))
knn_res &lt;- 
  tune_grid(
    knn_wflow,
    resamples = folds,
    metrics = metric,
    grid = knn_grid
  )
collect_metrics(knn_res)
autoplot(knn_res)
show_best(knn_res,metric=&quot;rmse&quot;)
# 最佳 KNN 模型
best_knn_spec &lt;-
  nearest_neighbor(
    mode = &quot;regression&quot;, 
    neighbors = 10,
    engine = &quot;kknn&quot;
  ) 
best_knn_wflow &lt;- 
  workflow() %&gt;% 
  add_model(best_knn_spec) %&gt;%
  add_recipe(knn_rec)
best_knn_fit &lt;- last_fit(best_knn_wflow, diamonds_split)
collect_metrics(best_knn_fit)

But when I try to fit the best model on the training set and applying it to the test set I run into problems. The following two lines give me the error: "Error in step_log():
! The following required column is missing from new_data in step 'log_mUSAb': price.
Run rlang::last_trace() to see where the error occurred."

# 手动预测
f1 = fit(best_knn_wflow,diamonds_train)
p1 = predict(f1,new_data=diamonds_test)

英文:

I'm learning tidymodels. The following code runs nicely:

library(tidyverse)
library(tidymodels)
# Draw a random sample of 2000 to try the models
set.seed(1234)
diamonds &lt;- diamonds %&gt;%    
sample_n(2000)
diamonds_split &lt;- initial_split(diamonds, prop = 0.80, strata=&quot;price&quot;)
diamonds_train &lt;- training(diamonds_split)
diamonds_test &lt;- testing(diamonds_split)
folds &lt;- rsample::vfold_cv(diamonds_train, v = 10, strata=&quot;price&quot;)
metric &lt;- metric_set(rmse,rsq,mae)
# Model KNN 
knn_spec &lt;-
nearest_neighbor(
mode = &quot;regression&quot;, 
neighbors = tune(&quot;k&quot;),
engine = &quot;kknn&quot;
) 
knn_rec &lt;-
recipe(price ~ ., data = diamonds_train) %&gt;%
step_log(all_outcomes()) %&gt;% 
step_normalize(all_numeric_predictors()) %&gt;% 
step_dummy(all_nominal_predictors())
knn_wflow &lt;- 
workflow() %&gt;% 
add_model(knn_spec) %&gt;%
add_recipe(knn_rec)
knn_grid = expand.grid(k=c(1,5,10,30))
knn_res &lt;- 
tune_grid(
knn_wflow,
resamples = folds,
metrics = metric,
grid = knn_grid
)
collect_metrics(knn_res)
autoplot(knn_res)
show_best(knn_res,metric=&quot;rmse&quot;)
# Best KNN 
best_knn_spec &lt;-
nearest_neighbor(
mode = &quot;regression&quot;, 
neighbors = 10,
engine = &quot;kknn&quot;
) 
best_knn_wflow &lt;- 
workflow() %&gt;% 
add_model(best_knn_spec) %&gt;%
add_recipe(knn_rec)
best_knn_fit &lt;- last_fit(best_knn_wflow, diamonds_split)
collect_metrics(best_knn_fit)

But when I try to fit the best model on the training set and applying it to the test set I run into problems. The following two lines give me the error : "Error in step_log():
! The following required column is missing from new_data in step 'log_mUSAb': price.
Run rlang::last_trace() to see where the error occurred."

# Predict Manually
f1 = fit(best_knn_wflow,diamonds_train)
p1 = predict(f1,new_data=diamonds_test)

答案1

得分: 1

这个问题与 https://stackoverflow.com/questions/76158409/log-transform-outcome-variable-in-tidymodels-workflow/76158558#76158558 相关。

对于对结果进行对数变换，我们强烈建议在将它们传递给 recipe() 之前进行这些变换。这是因为在预测时（也就是当您对工作流进行 last_fit() 操作时）不能保证会有结果变量。这会导致配方失败。

您在这里看到这个问题是因为当您对 workflow() 对象进行预测时，它只传递预测变量，因为那是它所需要的。这就是为什么您看到这个错误的原因。

由于对数变换不是一个学习到的变换，您可以放心地在之前进行它。

diamonds_train$price <- log(diamonds_train$price)
if (!is.null(diamonds_test$price)) {
  diamonds_test$price <- log(diamonds_test$price)
}

英文:

This problem is related to https://stackoverflow.com/questions/76158409/log-transform-outcome-variable-in-tidymodels-workflow/76158558#76158558

For log transformations to the outcome, we strongly recommend that those transformation be done before you pass them to the recipe(). This is because you are not guaranteed to have an outcome when predicting (which is what happens when you last_fit() a workflow) on new data. And the recipe fails.

You are seeing this here as when you predict on a workflow() object, it only passes the predictors, as it is all that it needs. Hence why you see this error.

Since log transformations isn't a learned transformation you can safely do it before.

diamonds_train$price &lt;- log(diamonds_train$price)
if (!is.null(diamonds_test$price)) {
  diamonds_test$price &lt;- log(diamonds_test$price)
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

问题出现在对新数据进行评分时 — tidymodels

问题

答案1

无法分配内存 – RSelenium 和 EC2

R: save a regex match to a new variable while removing the regex match from the existing variable using `str_extract()`

可以在chromote中包含JavaScript库吗？

如何定义六边形网格的单元大小？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。