理解为什么tune::last_fit的指标与summary()不同。

huangapple go评论97阅读模式
英文:

Understanding why tune::last_fit metrics are different from summary()

问题

问题: tune::collect_metrics()summary() 计算的度量指标(这里是R²)之间有什么区别?哪一个对应于独立数据集中观测值与这些观测值的预测之间的R²?

回答: tune::collect_metrics() 返回的R²是0.729,而 summary() 返回的R²是0.9204。这两者计算的R²值不相等。通常情况下,summary() 返回的R²是基于模型的拟合效果,而 tune::collect_metrics() 返回的R²可能是在交叉验证或其他评估方法下计算的模型性能度量,因此它们可以不相等。独立数据集中的R²通常可以通过使用模型进行预测,然后计算观测值和预测值之间的R²来获得。

英文:

Context: I try to evaluate a model, made using tune::last_fit() with an independent dataset.

Problem: it seems that the metrics obtained with tune::collect_metrics() are different from the ones obtained using summary().

Question: what is the difference between the metric (here the R²) calculated using tune::collect_metrics() and summary()? Which one corresponds to the R² between observation from the independent dataset and predictions of these observations?

Reproducible example: using the example from https://tune.tidymodels.org/reference/last_fit.html as a starting point.

  1. library(recipes)
  2. library(rsample)
  3. library(parsnip)
  4. set.seed(6735)
  5. tr_te_split <- initial_split(mtcars)
  6. spline_rec <- recipe(mpg ~ ., data = mtcars) %>%
  7. step_ns(disp)
  8. lin_mod <- linear_reg() %>%
  9. set_engine("lm")
  10. spline_res <- tune::last_fit(lin_mod, spline_rec, split = tr_te_split)
  11. spline_res
  12. #> # Resampling results
  13. #> # Manual resampling
  14. #> # A tibble: 1 × 6
  15. #> splits id .metrics .notes .predictions .workflow
  16. #> <list> <chr> <list> <list> <list> <list>
  17. #> 1 <split [24/8]> train/test split <tibble> <tibble> <tibble [8 × 4]> <workflow>
  18. # Here are the performance metrics for the model
  19. tune::collect_metrics(spline_res)
  20. #> # A tibble: 2 × 4
  21. #> .metric .estimator .estimate .config
  22. #> <chr> <chr> <dbl> <chr>
  23. #> 1 rmse standard 3.80 Preprocessor1_Model1
  24. #> 2 rsq standard 0.729 Preprocessor1_Model1
  25. spline_res %>%
  26. parsnip::extract_fit_engine() %>% # back to stats lm object
  27. summary()
  28. #>
  29. #> Call:
  30. #> stats::lm(formula = ..y ~ ., data = data)
  31. #>
  32. #> Residuals:
  33. #> Min 1Q Median 3Q Max
  34. #> -3.4453 -1.1980 -0.1464 1.3246 2.8223
  35. #>
  36. #> Coefficients:
  37. #> Estimate Std. Error t value Pr(>|t|)
  38. #> (Intercept) 23.087028 18.641785 1.238 0.239
  39. #> cyl 0.326218 1.402236 0.233 0.820
  40. #> hp 0.005969 0.024848 0.240 0.814
  41. #> drat -0.009576 1.597293 -0.006 0.995
  42. #> wt -0.902839 2.503336 -0.361 0.725
  43. #> qsec 0.185826 0.745021 0.249 0.807
  44. #> vs 1.492756 2.255781 0.662 0.521
  45. #> am 4.101555 3.110797 1.318 0.212
  46. #> gear 0.174875 1.730223 0.101 0.921
  47. #> carb -1.278962 1.009824 -1.267 0.229
  48. #> disp_ns_1 -15.149506 13.649995 -1.110 0.289
  49. #> disp_ns_2 -4.905087 6.756046 -0.726 0.482
  50. #>
  51. #> Residual standard error: 2.397 on 12 degrees of freedom
  52. #> Multiple R-squared: 0.9204, Adjusted R-squared: 0.8473
  53. #> F-statistic: 12.61 on 11 and 12 DF, p-value: 5.869e-05

<sup>Created on 2023-05-22 with reprex v2.0.2</sup>

As you can see, both R² are not equal.

答案1

得分: 1

last_fit()函数返回的统计数据来自留存数据。而summary.lm()函数返回的统计数据则来自用于拟合模型的相同数据。

在建模过程中,重复使用数据来评估模型性能是一个重大陷阱。它会给出乐观的结果(可能根据模型而定,甚至是极其乐观的结果)。

关于这个问题有大量的参考资料。我们在tdiymodels书籍中提供了一个小例子。

另外,虽然这不是问题,但tidymodels(以及它之前的caret)在计算$R^2$时使用的是不同的估算方法,而不是线性回归中常用的标准方法(参见?yardstick::rsq)。它在模型的指标接近零时性能更好。

英文:

The statistics that you get via last_fit() are from holdout data. The ones from summary.lm() are not; they are from the same data being used to fit the model.

The re-use of data to assess model performance is a major pitfall when modeling. It will give you optimistic results (perhaps overwhelmingly optimistic, depending on the model).

There are tons of references on this. We give a small example in the tdiymodels book.

Also, while this is not the issue, tidymodels (and caret before it) use a different estimator for $R^2$ than the canonical one used by linear regression (see ?yardstick::rsq). It performs better when the models have metrics closer to zero.

huangapple
  • 本文由 发表于 2023年5月22日 23:10:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76307579.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定