2023年3月3日 23:38:30go评论79阅读模式

英文:

Using Yardstick to calculate RMSE for aggregate of predictions per group

问题

有时候，我不想评估我的模型在预测单个观察结果上的性能，而是想评估模型对组别的预测性能。在rsample中的组别重抽样工具，如group_vfold_cv，非常适用于确保所有数据拆分都保持组别在一起。但我想评估模型在组别性能上而不是在单个观察结果上的性能。

例如，也许我想使用一个模型来预测单个住房的价格，但最终我要使用模型来估计一个社区的价值。以Ames数据集为例。我们可以建立模型来预测房屋的售价。但我想调整模型，以便它在预测一个社区的房价总和时性能更好（假设Ames数据集对每个社区都是“完整”的）。

我提供了下面的示例代码。出于速度原因，我保持了重抽样和网格的最小化。

# 加载数据并稍微转换Neighborhood变量
library(tidymodels)
df <- ames
df <- recipe(Sale_Price ~ ., data = df) %>%
  step_other(Neighborhood, threshold = .04) %>%
  prep() %>%
  bake(new_data = df)
# 根据邻居分割数据
set.seed(1)
df_splits <- group_initial_split(df, group = Neighborhood)
df_train <- training(df_splits)
df_test <- testing(df_splits)
set.seed(2)
df_folds <- group_vfold_cv(df_train, group = Neighborhood, v = 5, repeats = 1)
# 用于建模Sale_Price的简单配方
rec <- recipe(Sale_Price ~ Lot_Area + Year_Built + Gr_Liv_Area, data = df_train)
# 为MARS和RF设置规范
mars_earth_spec <-
  mars(prod_degree = tune()) %>%
  set_engine('earth') %>%
  set_mode('regression')
rand_forest_ranger_spec <-
  rand_forest(mtry = tune(), min_n = tune()) %>%
  set_engine('ranger') %>%
  set_mode('regression')
# 设置将我们的配方与模型配对的工作流
no_pre_proc <- 
  workflow_set(
    preproc = list(simple = rec), 
    models = list(MARS = mars_earth_spec, RF = rand_forest_ranger_spec)
  )
# 调整模型
grid_ctrl <-
  control_grid(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE
  )
grid_results <-
  no_pre_proc %>%
  workflow_map(
    seed = 1503,
    resamples = df_folds,
    grid = 5,
    control = grid_ctrl
  )
# 通过RMSE对模型进行排名，基于它们估计单个房屋的性能
grid_results %>% 
  rank_results() %>% 
  filter(.metric == "rmse") %>% 
  select(model, .config, rmse = mean, rank)
# 这不是我想要的
# 我想要根据每个社区的总体预测的RMSE对模型进行排名，与总销售价格的总和相比
# 也许我需要类似的东西... Truth = sum(Sale_Price, by = Neighborhood), estimate = sum(.pred, by Neighborhood)

我可以评估单个房屋的模型RMSE，但我想要评估模型对社区价值的RMSE。

英文:

Sometimes I don't want to assess my models on their performance on predicting single observations, but rather I want to assess how a model performs for predictions in aggregate for groups. The group resampling tools in rsample, like group_vfold_cv, are great for ensuring all data splitting keeps groups together. But I want to assess models on group performance rather than performance for single observations.

For example, maybe I want to use a model that predicts induvial housing prices, but I'm ultimately going to use the model to estimate how much a neighborhood is worth.
Using the Ames dataset as an example. We can build models to predict house's sale price. But instead of tuning the model base on the model performance for predicting individual houses, I want to tune the model on its performance in predicting the sum of housing prices for a neighborhood. (I'm imagining that the Ames dataset is "complete" for each neighborhood.)

I have provided a sample code below. And for speed reasons, I kept the resampling and grid minimal.

#Load in data and transform Neighborhood variable a little
library(tidymodels)
df &lt;- ames
df &lt;- recipe(Sale_Price ~ ., data = df) %&gt;% 
step_other(Neighborhood, threshold = .04) %&gt;% 
prep() %&gt;% 
bake(new_data = df)
#Split data based off nieghborhoods
set.seed(1)
df_splits &lt;- group_initial_split(df, group = Neighborhood)
df_train &lt;- training(df_splits)
df_test &lt;- testing(df_splits)
set.seed(2)
df_folds &lt;- group_vfold_cv(df_train, group = Neighborhood, v = 5, repeats = 1)
#Simple recipe for modeling Sale_Price
rec &lt;- recipe(Sale_Price ~ Lot_Area + Year_Built + Gr_Liv_Area, data = df_train)
#Setting up specification for MARS and RF
mars_earth_spec &lt;-
mars(prod_degree = tune()) %&gt;%
set_engine(&#39;earth&#39;) %&gt;%
set_mode(&#39;regression&#39;)
rand_forest_ranger_spec &lt;-
rand_forest(mtry = tune(), min_n = tune()) %&gt;%
set_engine(&#39;ranger&#39;) %&gt;%
set_mode(&#39;regression&#39;)
#Setting up the workflow that pairs our recipe with models
no_pre_proc &lt;- 
workflow_set(
preproc = list(simple = rec), 
models = list(MARS = mars_earth_spec, RF = rand_forest_ranger_spec)
)
#Tune the models
grid_ctrl &lt;-
control_grid(
save_pred = TRUE,
parallel_over = &quot;everything&quot;,
save_workflow = TRUE
)
grid_results &lt;-
no_pre_proc %&gt;%
workflow_map(
seed = 1503,
resamples = df_folds,
grid = 5,
control = grid_ctrl
)
#Ranking the models by RMSE for models based off their performance estimating individual houses
grid_results %&gt;% 
rank_results() %&gt;% 
filter(.metric == &quot;rmse&quot;) %&gt;% 
select(model, .config, rmse = mean, rank)
#This is not what I want
#I want to rank the models by RMSE of aggregate predictions per neighborhood against the aggregate sale price
#Maybe I need something like... Truth = sum(Sale_Price, by = Neighborhood), estimate = sum(.pred, by Nieghborhood)

I can assess model's RMSE for individual houses, but I want to assess model's RMSE for neighborhood worth.

答案1

得分: 1

没有针对这个目标的内置支持，但你应该能够手动完成。

由于我们在 control_grid() 中设置了 save_pred = TRUE，我们可以使用 collect_predictions() 和 summarize = FALSE 获取所有这些预测。

然后，一系列 {dplyr} 函数和可以应用于分组数据框的 rmse() 应该可以得到你想要的结果。

#加载数据并稍微转换 Neighborhood 变量
library(tidymodels)
df <- ames
df <- recipe(Sale_Price ~ ., data = df) %>% 
  step_other(Neighborhood, threshold = .04) %>% 
  prep() %>% 
  bake(new_data = df)
#基于邻里拆分数据
set.seed(1)
df_splits <- group_initial_split(df, group = Neighborhood)
df_train <- training(df_splits)
df_test <- testing(df_splits)
set.seed(2)
df_folds <- group_vfold_cv(df_train, group = Neighborhood, v = 5, repeats = 1)
#建立 Sale_Price 的简单建模配方
rec <- recipe(Sale_Price ~ Lot_Area + Year_Built + Gr_Liv_Area, data = df_train)
#为 MARS 和 RF 设置规格
mars_earth_spec <-
  mars(prod_degree = tune()) %>%
  set_engine('earth') %>%
  set_mode('regression')
rand_forest_ranger_spec <-
  rand_forest(mtry = tune(), min_n = tune()) %>%
  set_engine('ranger') %>%
  set_mode('regression')
#建立将我们的配方与模型配对的工作流
no_pre_proc <- 
  workflow_set(
    preproc = list(simple = rec), 
    models = list(MARS = mars_earth_spec, RF = rand_forest_ranger_spec)
  )
#调整模型
grid_ctrl <-
  control_grid(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE
  )
grid_results <-
  no_pre_proc %>%
  workflow_map(
    seed = 1503,
    resamples = df_folds,
    grid = 5,
    control = grid_ctrl
  )
#> i Creating pre-processing data to finalize unknown parameter: mtry
grid_results %>%
  collect_predictions(summarize = FALSE) %>%
  mutate(Neighborhood = df_train$Neighborhood[.row]) %>%
  group_by(id, model, .config, Neighborhood) %>%
  summarise(Sale_Price = sum(Sale_Price), .pred = sum(.pred), .groups = "drop") %>%
  group_by(id, model, .config) %>%
  rmse(truth = Sale_Price, estimate = .pred) %>%
  group_by(model, .config) %>%
  summarize(mean_rmse = mean(.estimate), .groups = "drop") %>%
  arrange(mean_rmse)
#> # A tibble: 7 × 3
#>   model       .config              mean_rmse
#>   <chr>       <chr>                    <dbl>
#> 1 rand_forest Preprocessor1_Model1  2667177.
#> 2 mars        Preprocessor1_Model2  2695526.
#> 3 rand_forest Preprocessor1_Model4  2819628.
#> 4 rand_forest Preprocessor1_Model5  2824109.
#> 5 rand_forest Preprocessor1_Model3  2845252.
#> 6 rand_forest Preprocessor1_Model2  3059321.
#> 7 mars        Preprocessor1_Model1  3563432.

英文:

There isn't built-in support for that goal, but you should be able to do it manually.

Since we have save_pred = TRUE in control_grid(), we can get all of those predictions using collect_predictions() with summarize = FALSE.

Then a series of {dplyr} functions and rmse() which can be applied to grouped data.frames should give you what you want.

#Load in data and transform Neighborhood variable a little
library(tidymodels)
df &lt;- ames
df &lt;- recipe(Sale_Price ~ ., data = df) %&gt;% 
  step_other(Neighborhood, threshold = .04) %&gt;% 
  prep() %&gt;% 
  bake(new_data = df)
#Split data based off nieghborhoods
set.seed(1)
df_splits &lt;- group_initial_split(df, group = Neighborhood)
df_train &lt;- training(df_splits)
df_test &lt;- testing(df_splits)
set.seed(2)
df_folds &lt;- group_vfold_cv(df_train, group = Neighborhood, v = 5, repeats = 1)
#Simple recipe for modeling Sale_Price
rec &lt;- recipe(Sale_Price ~ Lot_Area + Year_Built + Gr_Liv_Area, data = df_train)
#Setting up specification for MARS and RF
mars_earth_spec &lt;-
  mars(prod_degree = tune()) %&gt;%
  set_engine(&#39;earth&#39;) %&gt;%
  set_mode(&#39;regression&#39;)
rand_forest_ranger_spec &lt;-
  rand_forest(mtry = tune(), min_n = tune()) %&gt;%
  set_engine(&#39;ranger&#39;) %&gt;%
  set_mode(&#39;regression&#39;)
#Setting up the workflow that pairs our recipe with models
no_pre_proc &lt;- 
  workflow_set(
    preproc = list(simple = rec), 
    models = list(MARS = mars_earth_spec, RF = rand_forest_ranger_spec)
  )
#Tune the models
grid_ctrl &lt;-
  control_grid(
    save_pred = TRUE,
    parallel_over = &quot;everything&quot;,
    save_workflow = TRUE
  )
grid_results &lt;-
  no_pre_proc %&gt;%
  workflow_map(
    seed = 1503,
    resamples = df_folds,
    grid = 5,
    control = grid_ctrl
  )
#&gt; i Creating pre-processing data to finalize unknown parameter: mtry
grid_results %&gt;%
  collect_predictions(summarize = FALSE) %&gt;%
  mutate(Neighborhood = df_train$Neighborhood[.row]) %&gt;%
  group_by(id, model, .config, Neighborhood) %&gt;%
  summarise(Sale_Price = sum(Sale_Price), .pred = sum(.pred), .groups = &quot;drop&quot;) %&gt;%
  group_by(id, model, .config) %&gt;%
  rmse(truth = Sale_Price, estimate = .pred) %&gt;%
  group_by(model, .config) %&gt;%
  summarize(mean_rmse = mean(.estimate), .groups = &quot;drop&quot;) %&gt;%
  arrange(mean_rmse)
#&gt; # A tibble: 7 &#215; 3
#&gt;   model       .config              mean_rmse
#&gt;   &lt;chr&gt;       &lt;chr&gt;                    &lt;dbl&gt;
#&gt; 1 rand_forest Preprocessor1_Model1  2667177.
#&gt; 2 mars        Preprocessor1_Model2  2695526.
#&gt; 3 rand_forest Preprocessor1_Model4  2819628.
#&gt; 4 rand_forest Preprocessor1_Model5  2824109.
#&gt; 5 rand_forest Preprocessor1_Model3  2845252.
#&gt; 6 rand_forest Preprocessor1_Model2  3059321.
#&gt; 7 mars        Preprocessor1_Model1  3563432.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用”Yardstick”来计算每个分组预测的均方根误差（RMSE）。

问题

答案1

在ggplot2中并排绘制因子。

将一个因素添加到cut()函数中。

pivot_longer 以创建多个变量。

Unlist elements from unequal vectors at the last level of a nested list while keeping the sublist name in R

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。