2023年4月17日 15:16:06go评论94阅读模式

英文:

How can I apply preprocessing in each cross-validation fold trained on each train part of the fold using tidymodels?

问题

I try to use the tidymodels R-package for an ml pipeline. I can define a preprocessing pipeline (a recipe) on the training data and apply it to each re-sample of my cross-validation. But this uses the (global) training data to preprocess the folds. What I would find rather correct is to define a preprocessing recipe on each "analysis" (i.e., training) part of the fold and apply it to the "assessment" (i.e., testing) part of the fold.

The following code gives an example of my problem:

library(tidyverse)
library(tidymodels)
set.seed(1000)
mtcars = mtcars %>%
  select(mpg, hp)
init_split <- initial_split(mtcars, prop = 0.9)
 
preprocessing_recipe <- recipe(mpg ~ hp,
                               data = training(init_split)
) %>%
  step_normalize(all_predictors())
preprocessing_recipe = preprocessing_recipe %>% prep()
preprocessing_recipe
cv_folds <-  bake(preprocessing_recipe, new_data = training(init_split)) %>%
  vfold_cv(v = 3)
## these resamples are not properly scaled:
training(cv_folds$splits[[1]]) %>% lapply(mean)
## $hp
## [1] 0.1442218
training(cv_folds$splits[[1]]) %>% lapply(sd)
## $hp
## [1] 1.167365
## while the preprocessing on the training data leads to exactly scaled data:
preprocessing_recipe$template %>% lapply(mean)
## $hp
## [1] -1.249001e-16
preprocessing_recipe$template %>% lapply(sd)
## $hp
## [1] 1

The reason why the above fails is clear. But how can I change the above pipeline (efficiently, elegantly) to define a recipe on each train part of the fold and apply it to the test part? In my view this is the way to do this that avoids data leakage. I haven't found any hints in the documentation of any posts. Thanks!

英文:

The following code gives an example of my problem:

library(tidyverse)
library(tidymodels)
set.seed(1000)
mtcars = mtcars |&gt; select(mpg, hp)
init_split &lt;- initial_split(mtcars, prop = 0.9)
preprocessing_recipe &lt;- recipe(mpg ~ hp,
                           data = training(init_split)
) |&gt;
step_normalize(all_predictors())
preprocessing_recipe = preprocessing_recipe %&gt;% prep()
preprocessing_recipe
cv_folds &lt;-  bake(preprocessing_recipe, new_data = training(init_split)) %&gt;%
vfold_cv(v = 3)
## these resamples are not properly scaled:
training(cv_folds$splits[[1]]) %&gt;% lapply(mean)
## $hp
## [1] 0.1442218
training(cv_folds$splits[[1]]) %&gt;% lapply(sd)
## $hp
## [1] 1.167365
## while the preprocessing on the training data leads to exactly scaled data:
preprocessing_recipe$template %&gt;% lapply(mean)
## $hp
## [1] -1.249001e-16
preprocessing_recipe$template %&gt;% lapply(sd)
## $hp
## [1] 1

答案1

得分: 1

以下是您要翻译的部分：

"当你在使用一个食谱作为完整流程的一部分时，你不太可能想要在诊断目的之外自己执行 prep() 或 bake()。我们建议的做法是使用一个 workflow() 将食谱与建模模型连接起来。这里我添加了一个线性回归规范。这两者可以一起进行 fit() 和 predict()。但你也可以将它们放在交叉验证循环中，使用 fit_resamples() 或 tune_grid()，具体取决于你的需求。

获取更多信息请参见：

https://www.tidymodels.org/start/

library(tidyverse)
library(tidymodels)
set.seed(1000)
mtcars <- mtcars |>
  select(mpg, hp)
init_split <- initial_split(mtcars, prop = 0.9)
mtcars_training <- training(init_split)
mtcars_folds <- vfold_cv(mtcars_training, v = 3)
preprocessing_recipe <- recipe(mpg ~ hp,
                               data = mtcars_training) |>
  step_normalize(all_predictors())
lm_spec <- linear_reg()
wf_spec <- workflow() |>
  add_recipe(preprocessing_recipe) |>
  add_model(lm_spec)
resampled_fits <- fit_resamples(
  wf_spec,
  resamples = mtcars_folds,
  control = control_resamples(extract = function(x) {
    tidy(x, "recipe", number = 1)
  })
)

我们可以通过查看食谱的估计值来看出工作流程适用于每个折叠。我在 control_resamples() 的 extract 参数中添加了一个函数，用于提取在食谱中计算的训练均值和标准差。

resampled_fits |>
  collect_extracts() |>
  pull(.extracts)

并且我们可以看到它们与原始折叠中的均值和标准差匹配。

mtcars_folds$splits |>
  map(analysis) |>
  map(~ tibble(mean = mean(.x$hp), sd = sd(.x$hp)))

英文:

When you are using a recipe you are as part of a full pipeline, you are unlikely to want to prep() or bake() it yourself outside of diagnostic purposes. What we recommend is to use the recipe with a workflow() to be able to attach it to a modeling model. Here I'm adding a linear regression specification. These two together can be fit() and predict()ed on. but you can also fit them inside your cross-validation loop, with fit_resamples() or tune_grid() depending on your needs.

For more information see:

https://www.tidymodels.org/start/

library(tidyverse)
library(tidymodels)
set.seed(1000)
mtcars &lt;- mtcars |&gt; 
  select(mpg, hp)
init_split &lt;- initial_split(mtcars, prop = 0.9)
mtcars_training &lt;- training(init_split)
mtcars_folds &lt;- vfold_cv(mtcars_training, v = 3)
preprocessing_recipe &lt;- recipe(mpg ~ hp,
                               data = mtcars_training) |&gt;
  step_normalize(all_predictors())
lm_spec &lt;- linear_reg()
wf_spec &lt;- workflow() |&gt;
  add_recipe(preprocessing_recipe) |&gt;
  add_model(lm_spec)
resampled_fits &lt;- fit_resamples(
  wf_spec,
  resamples = mtcars_folds,
  control = control_resamples(extract = function(x) {
    tidy(x, &quot;recipe&quot;, number = 1)
  })
)

We can see that the workflow is fit inside each fold by looking at the estimates of the recipe. I added a function to the extract argument of control_resamples() that pull out trained mean and sd that were calculated in the recipe.

resampled_fits |&gt; 
  collect_extracts() |&gt; 
  pull(.extracts)
#&gt; [[1]]
#&gt; # A tibble: 2 &#215; 4
#&gt;   terms statistic value id             
#&gt;   &lt;chr&gt; &lt;chr&gt;     &lt;dbl&gt; &lt;chr&gt;          
#&gt; 1 hp    mean      140.  normalize_x5pUR
#&gt; 2 hp    sd         77.3 normalize_x5pUR
#&gt; 
#&gt; [[2]]
#&gt; # A tibble: 2 &#215; 4
#&gt;   terms statistic value id             
#&gt;   &lt;chr&gt; &lt;chr&gt;     &lt;dbl&gt; &lt;chr&gt;          
#&gt; 1 hp    mean      144.  normalize_x5pUR
#&gt; 2 hp    sd         57.4 normalize_x5pUR
#&gt; 
#&gt; [[3]]
#&gt; # A tibble: 2 &#215; 4
#&gt;   terms statistic value id             
#&gt;   &lt;chr&gt; &lt;chr&gt;     &lt;dbl&gt; &lt;chr&gt;          
#&gt; 1 hp    mean      150.  normalize_x5pUR
#&gt; 2 hp    sd         74.9 normalize_x5pUR

And we can see that they match the mean and sd from the original folds

mtcars_folds$splits |&gt;
  map(analysis) |&gt;
  map(~ tibble(mean = mean(.x$hp), sd = sd(.x$hp)))
#&gt; [[1]]
#&gt; # A tibble: 1 &#215; 2
#&gt;    mean    sd
#&gt;   &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1  140.  77.3
#&gt; 
#&gt; [[2]]
#&gt; # A tibble: 1 &#215; 2
#&gt;    mean    sd
#&gt;   &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1  144.  57.4
#&gt; 
#&gt; [[3]]
#&gt; # A tibble: 1 &#215; 2
#&gt;    mean    sd
#&gt;   &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1  150.  74.9

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

我如何在每个交叉验证折叠中的每个训练部分上应用预处理，使用tidymodels？

问题

答案1

在使用bslib和Bootstrap样式的数据表格中移除单元格边框。

创建二进制表，其中行数是来自变量的值。

分层环形图以在R中更好地区分子群。

定义rhandsontable中contextMenu的自定义项目

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。