使用”Yardstick”来计算每个分组预测的均方根误差(RMSE)。

huangapple go评论58阅读模式
英文:

Using Yardstick to calculate RMSE for aggregate of predictions per group

问题

有时候,我不想评估我的模型在预测单个观察结果上的性能,而是想评估模型对组别的预测性能。在rsample中的组别重抽样工具,如group_vfold_cv,非常适用于确保所有数据拆分都保持组别在一起。但我想评估模型在组别性能上而不是在单个观察结果上的性能。

例如,也许我想使用一个模型来预测单个住房的价格,但最终我要使用模型来估计一个社区的价值。以Ames数据集为例。我们可以建立模型来预测房屋的售价。但我想调整模型,以便它在预测一个社区的房价总和时性能更好(假设Ames数据集对每个社区都是“完整”的)。

我提供了下面的示例代码。出于速度原因,我保持了重抽样和网格的最小化。

# 加载数据并稍微转换Neighborhood变量
library(tidymodels)
df <- ames
df <- recipe(Sale_Price ~ ., data = df) %>%
  step_other(Neighborhood, threshold = .04) %>%
  prep() %>%
  bake(new_data = df)

# 根据邻居分割数据
set.seed(1)
df_splits <- group_initial_split(df, group = Neighborhood)
df_train <- training(df_splits)
df_test <- testing(df_splits)
set.seed(2)
df_folds <- group_vfold_cv(df_train, group = Neighborhood, v = 5, repeats = 1)

# 用于建模Sale_Price的简单配方
rec <- recipe(Sale_Price ~ Lot_Area + Year_Built + Gr_Liv_Area, data = df_train)

# 为MARS和RF设置规范
mars_earth_spec <-
  mars(prod_degree = tune()) %>%
  set_engine('earth') %>%
  set_mode('regression')
rand_forest_ranger_spec <-
  rand_forest(mtry = tune(), min_n = tune()) %>%
  set_engine('ranger') %>%
  set_mode('regression')

# 设置将我们的配方与模型配对的工作流
no_pre_proc <- 
  workflow_set(
    preproc = list(simple = rec), 
    models = list(MARS = mars_earth_spec, RF = rand_forest_ranger_spec)
  )

# 调整模型
grid_ctrl <-
  control_grid(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE
  )
grid_results <-
  no_pre_proc %>%
  workflow_map(
    seed = 1503,
    resamples = df_folds,
    grid = 5,
    control = grid_ctrl
  )

# 通过RMSE对模型进行排名,基于它们估计单个房屋的性能
grid_results %>% 
  rank_results() %>% 
  filter(.metric == "rmse") %>% 
  select(model, .config, rmse = mean, rank)
# 这不是我想要的
# 我想要根据每个社区的总体预测的RMSE对模型进行排名,与总销售价格的总和相比
# 也许我需要类似的东西... Truth = sum(Sale_Price, by = Neighborhood), estimate = sum(.pred, by Neighborhood)

我可以评估单个房屋的模型RMSE,但我想要评估模型对社区价值的RMSE。

英文:

Sometimes I don't want to assess my models on their performance on predicting single observations, but rather I want to assess how a model performs for predictions in aggregate for groups. The group resampling tools in rsample, like group_vfold_cv, are great for ensuring all data splitting keeps groups together. But I want to assess models on group performance rather than performance for single observations.

For example, maybe I want to use a model that predicts induvial housing prices, but I'm ultimately going to use the model to estimate how much a neighborhood is worth.
Using the Ames dataset as an example. We can build models to predict house's sale price. But instead of tuning the model base on the model performance for predicting individual houses, I want to tune the model on its performance in predicting the sum of housing prices for a neighborhood. (I'm imagining that the Ames dataset is "complete" for each neighborhood.)

I have provided a sample code below. And for speed reasons, I kept the resampling and grid minimal.

#Load in data and transform Neighborhood variable a little
library(tidymodels)
df &lt;- ames
df &lt;- recipe(Sale_Price ~ ., data = df) %&gt;% 
step_other(Neighborhood, threshold = .04) %&gt;% 
prep() %&gt;% 
bake(new_data = df)
#Split data based off nieghborhoods
set.seed(1)
df_splits &lt;- group_initial_split(df, group = Neighborhood)
df_train &lt;- training(df_splits)
df_test &lt;- testing(df_splits)
set.seed(2)
df_folds &lt;- group_vfold_cv(df_train, group = Neighborhood, v = 5, repeats = 1)
#Simple recipe for modeling Sale_Price
rec &lt;- recipe(Sale_Price ~ Lot_Area + Year_Built + Gr_Liv_Area, data = df_train)
#Setting up specification for MARS and RF
mars_earth_spec &lt;-
mars(prod_degree = tune()) %&gt;%
set_engine(&#39;earth&#39;) %&gt;%
set_mode(&#39;regression&#39;)
rand_forest_ranger_spec &lt;-
rand_forest(mtry = tune(), min_n = tune()) %&gt;%
set_engine(&#39;ranger&#39;) %&gt;%
set_mode(&#39;regression&#39;)
#Setting up the workflow that pairs our recipe with models
no_pre_proc &lt;- 
workflow_set(
preproc = list(simple = rec), 
models = list(MARS = mars_earth_spec, RF = rand_forest_ranger_spec)
)
#Tune the models
grid_ctrl &lt;-
control_grid(
save_pred = TRUE,
parallel_over = &quot;everything&quot;,
save_workflow = TRUE
)
grid_results &lt;-
no_pre_proc %&gt;%
workflow_map(
seed = 1503,
resamples = df_folds,
grid = 5,
control = grid_ctrl
)
#Ranking the models by RMSE for models based off their performance estimating individual houses
grid_results %&gt;% 
rank_results() %&gt;% 
filter(.metric == &quot;rmse&quot;) %&gt;% 
select(model, .config, rmse = mean, rank)
#This is not what I want
#I want to rank the models by RMSE of aggregate predictions per neighborhood against the aggregate sale price
#Maybe I need something like... Truth = sum(Sale_Price, by = Neighborhood), estimate = sum(.pred, by Nieghborhood)

I can assess model's RMSE for individual houses, but I want to assess model's RMSE for neighborhood worth.

答案1

得分: 1

没有针对这个目标的内置支持,但你应该能够手动完成。

由于我们在 control_grid() 中设置了 save_pred = TRUE,我们可以使用 collect_predictions()summarize = FALSE 获取所有这些预测。

然后,一系列 {dplyr} 函数和可以应用于分组数据框的 rmse() 应该可以得到你想要的结果。

#加载数据并稍微转换 Neighborhood 变量
library(tidymodels)
df <- ames
df <- recipe(Sale_Price ~ ., data = df) %>% 
  step_other(Neighborhood, threshold = .04) %>% 
  prep() %>% 
  bake(new_data = df)

#基于邻里拆分数据
set.seed(1)
df_splits <- group_initial_split(df, group = Neighborhood)
df_train <- training(df_splits)
df_test <- testing(df_splits)
set.seed(2)
df_folds <- group_vfold_cv(df_train, group = Neighborhood, v = 5, repeats = 1)

#建立 Sale_Price 的简单建模配方
rec <- recipe(Sale_Price ~ Lot_Area + Year_Built + Gr_Liv_Area, data = df_train)

#为 MARS 和 RF 设置规格
mars_earth_spec <-
  mars(prod_degree = tune()) %>%
  set_engine('earth') %>%
  set_mode('regression')
rand_forest_ranger_spec <-
  rand_forest(mtry = tune(), min_n = tune()) %>%
  set_engine('ranger') %>%
  set_mode('regression')

#建立将我们的配方与模型配对的工作流
no_pre_proc <- 
  workflow_set(
    preproc = list(simple = rec), 
    models = list(MARS = mars_earth_spec, RF = rand_forest_ranger_spec)
  )

#调整模型
grid_ctrl <-
  control_grid(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE
  )
grid_results <-
  no_pre_proc %>%
  workflow_map(
    seed = 1503,
    resamples = df_folds,
    grid = 5,
    control = grid_ctrl
  )
#> i Creating pre-processing data to finalize unknown parameter: mtry

grid_results %>%
  collect_predictions(summarize = FALSE) %>%
  mutate(Neighborhood = df_train$Neighborhood[.row]) %>%
  group_by(id, model, .config, Neighborhood) %>%
  summarise(Sale_Price = sum(Sale_Price), .pred = sum(.pred), .groups = "drop") %>%
  group_by(id, model, .config) %>%
  rmse(truth = Sale_Price, estimate = .pred) %>%
  group_by(model, .config) %>%
  summarize(mean_rmse = mean(.estimate), .groups = "drop") %>%
  arrange(mean_rmse)
#> # A tibble: 7 × 3
#>   model       .config              mean_rmse
#>   <chr>       <chr>                    <dbl>
#> 1 rand_forest Preprocessor1_Model1  2667177.
#> 2 mars        Preprocessor1_Model2  2695526.
#> 3 rand_forest Preprocessor1_Model4  2819628.
#> 4 rand_forest Preprocessor1_Model5  2824109.
#> 5 rand_forest Preprocessor1_Model3  2845252.
#> 6 rand_forest Preprocessor1_Model2  3059321.
#> 7 mars        Preprocessor1_Model1  3563432.
英文:

There isn't built-in support for that goal, but you should be able to do it manually.

Since we have save_pred = TRUE in control_grid(), we can get all of those predictions using collect_predictions() with summarize = FALSE.

Then a series of {dplyr} functions and rmse() which can be applied to grouped data.frames should give you what you want.

#Load in data and transform Neighborhood variable a little
library(tidymodels)
df &lt;- ames
df &lt;- recipe(Sale_Price ~ ., data = df) %&gt;% 
  step_other(Neighborhood, threshold = .04) %&gt;% 
  prep() %&gt;% 
  bake(new_data = df)

#Split data based off nieghborhoods
set.seed(1)
df_splits &lt;- group_initial_split(df, group = Neighborhood)
df_train &lt;- training(df_splits)
df_test &lt;- testing(df_splits)
set.seed(2)
df_folds &lt;- group_vfold_cv(df_train, group = Neighborhood, v = 5, repeats = 1)

#Simple recipe for modeling Sale_Price
rec &lt;- recipe(Sale_Price ~ Lot_Area + Year_Built + Gr_Liv_Area, data = df_train)

#Setting up specification for MARS and RF
mars_earth_spec &lt;-
  mars(prod_degree = tune()) %&gt;%
  set_engine(&#39;earth&#39;) %&gt;%
  set_mode(&#39;regression&#39;)
rand_forest_ranger_spec &lt;-
  rand_forest(mtry = tune(), min_n = tune()) %&gt;%
  set_engine(&#39;ranger&#39;) %&gt;%
  set_mode(&#39;regression&#39;)

#Setting up the workflow that pairs our recipe with models
no_pre_proc &lt;- 
  workflow_set(
    preproc = list(simple = rec), 
    models = list(MARS = mars_earth_spec, RF = rand_forest_ranger_spec)
  )

#Tune the models
grid_ctrl &lt;-
  control_grid(
    save_pred = TRUE,
    parallel_over = &quot;everything&quot;,
    save_workflow = TRUE
  )
grid_results &lt;-
  no_pre_proc %&gt;%
  workflow_map(
    seed = 1503,
    resamples = df_folds,
    grid = 5,
    control = grid_ctrl
  )
#&gt; i Creating pre-processing data to finalize unknown parameter: mtry

grid_results %&gt;%
  collect_predictions(summarize = FALSE) %&gt;%
  mutate(Neighborhood = df_train$Neighborhood[.row]) %&gt;%
  group_by(id, model, .config, Neighborhood) %&gt;%
  summarise(Sale_Price = sum(Sale_Price), .pred = sum(.pred), .groups = &quot;drop&quot;) %&gt;%
  group_by(id, model, .config) %&gt;%
  rmse(truth = Sale_Price, estimate = .pred) %&gt;%
  group_by(model, .config) %&gt;%
  summarize(mean_rmse = mean(.estimate), .groups = &quot;drop&quot;) %&gt;%
  arrange(mean_rmse)
#&gt; # A tibble: 7 &#215; 3
#&gt;   model       .config              mean_rmse
#&gt;   &lt;chr&gt;       &lt;chr&gt;                    &lt;dbl&gt;
#&gt; 1 rand_forest Preprocessor1_Model1  2667177.
#&gt; 2 mars        Preprocessor1_Model2  2695526.
#&gt; 3 rand_forest Preprocessor1_Model4  2819628.
#&gt; 4 rand_forest Preprocessor1_Model5  2824109.
#&gt; 5 rand_forest Preprocessor1_Model3  2845252.
#&gt; 6 rand_forest Preprocessor1_Model2  3059321.
#&gt; 7 mars        Preprocessor1_Model1  3563432.

huangapple
  • 本文由 发表于 2023年3月3日 23:38:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/75629097.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定