2023年5月22日 23:10:06go评论97阅读模式

英文:

Understanding why tune::last_fit metrics are different from summary()

问题

问题： tune::collect_metrics() 和 summary() 计算的度量指标（这里是R²）之间有什么区别？哪一个对应于独立数据集中观测值与这些观测值的预测之间的R²？

回答： tune::collect_metrics() 返回的R²是0.729，而 summary() 返回的R²是0.9204。这两者计算的R²值不相等。通常情况下，summary() 返回的R²是基于模型的拟合效果，而 tune::collect_metrics() 返回的R²可能是在交叉验证或其他评估方法下计算的模型性能度量，因此它们可以不相等。独立数据集中的R²通常可以通过使用模型进行预测，然后计算观测值和预测值之间的R²来获得。

英文:

Context: I try to evaluate a model, made using tune::last_fit() with an independent dataset.

Problem: it seems that the metrics obtained with tune::collect_metrics() are different from the ones obtained using summary().

Question: what is the difference between the metric (here the R²) calculated using tune::collect_metrics() and summary()? Which one corresponds to the R² between observation from the independent dataset and predictions of these observations?

Reproducible example: using the example from https://tune.tidymodels.org/reference/last_fit.html as a starting point.

library(recipes)
library(rsample)
library(parsnip)
set.seed(6735)
tr_te_split &lt;- initial_split(mtcars)
spline_rec &lt;- recipe(mpg ~ ., data = mtcars) %&gt;%
  step_ns(disp)
lin_mod &lt;- linear_reg() %&gt;%
  set_engine(&quot;lm&quot;)
spline_res &lt;- tune::last_fit(lin_mod, spline_rec, split = tr_te_split)
spline_res
#&gt; # Resampling results
#&gt; # Manual resampling 
#&gt; # A tibble: 1 &#215; 6
#&gt;   splits         id               .metrics .notes   .predictions     .workflow 
#&gt;   &lt;list&gt;         &lt;chr&gt;            &lt;list&gt;   &lt;list&gt;   &lt;list&gt;           &lt;list&gt;    
#&gt; 1 &lt;split [24/8]&gt; train/test split &lt;tibble&gt; &lt;tibble&gt; &lt;tibble [8 &#215; 4]&gt; &lt;workflow&gt;
# Here are the performance metrics for the model
tune::collect_metrics(spline_res)
#&gt; # A tibble: 2 &#215; 4
#&gt;   .metric .estimator .estimate .config             
#&gt;   &lt;chr&gt;   &lt;chr&gt;          &lt;dbl&gt; &lt;chr&gt;               
#&gt; 1 rmse    standard       3.80  Preprocessor1_Model1
#&gt; 2 rsq     standard       0.729 Preprocessor1_Model1
spline_res %&gt;% 
  parsnip::extract_fit_engine() %&gt;% # back to stats lm object
  summary()
#&gt; 
#&gt; Call:
#&gt; stats::lm(formula = ..y ~ ., data = data)
#&gt; 
#&gt; Residuals:
#&gt;     Min      1Q  Median      3Q     Max 
#&gt; -3.4453 -1.1980 -0.1464  1.3246  2.8223 
#&gt; 
#&gt; Coefficients:
#&gt;               Estimate Std. Error t value Pr(&gt;|t|)
#&gt; (Intercept)  23.087028  18.641785   1.238    0.239
#&gt; cyl           0.326218   1.402236   0.233    0.820
#&gt; hp            0.005969   0.024848   0.240    0.814
#&gt; drat         -0.009576   1.597293  -0.006    0.995
#&gt; wt           -0.902839   2.503336  -0.361    0.725
#&gt; qsec          0.185826   0.745021   0.249    0.807
#&gt; vs            1.492756   2.255781   0.662    0.521
#&gt; am            4.101555   3.110797   1.318    0.212
#&gt; gear          0.174875   1.730223   0.101    0.921
#&gt; carb         -1.278962   1.009824  -1.267    0.229
#&gt; disp_ns_1   -15.149506  13.649995  -1.110    0.289
#&gt; disp_ns_2    -4.905087   6.756046  -0.726    0.482
#&gt; 
#&gt; Residual standard error: 2.397 on 12 degrees of freedom
#&gt; Multiple R-squared:  0.9204, Adjusted R-squared:  0.8473 
#&gt; F-statistic: 12.61 on 11 and 12 DF,  p-value: 5.869e-05

<sup>Created on 2023-05-22 with reprex v2.0.2</sup>

As you can see, both R² are not equal.

答案1

得分: 1

last_fit()函数返回的统计数据来自留存数据。而summary.lm()函数返回的统计数据则来自用于拟合模型的相同数据。

在建模过程中，重复使用数据来评估模型性能是一个重大陷阱。它会给出乐观的结果（可能根据模型而定，甚至是极其乐观的结果）。

关于这个问题有大量的参考资料。我们在tdiymodels书籍中提供了一个小例子。

另外，虽然这不是问题，但tidymodels（以及它之前的caret）在计算$R^2$时使用的是不同的估算方法，而不是线性回归中常用的标准方法（参见?yardstick::rsq）。它在模型的指标接近零时性能更好。

英文:

The statistics that you get via last_fit() are from holdout data. The ones from summary.lm() are not; they are from the same data being used to fit the model.

The re-use of data to assess model performance is a major pitfall when modeling. It will give you optimistic results (perhaps overwhelmingly optimistic, depending on the model).

There are tons of references on this. We give a small example in the tdiymodels book.

Also, while this is not the issue, tidymodels (and caret before it) use a different estimator for $R^2$ than the canonical one used by linear regression (see ?yardstick::rsq). It performs better when the models have metrics closer to zero.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

理解为什么tune::last_fit的指标与summary()不同。

问题

答案1

如何在使用 left_join() 合并数据时保留标签？

统计跨多列的发生次数，并按年份分组

如何在一个列中分隔多个答案，对于多个列，通过创建额外的列

ggplotly在HTML文档中无法显示。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。