关于分段拟合有什么问题?

huangapple go评论98阅读模式
英文:

What's wrong with the piecewise fitting

问题

以下是您提供的代码的翻译部分:

  1. 我是R的新手。我想问一个问题:
  2. 这是数据:
  3. year <- c(2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022)
  4. score <- c(85, 85, 88, 88, 94, 94, 94, 82, 82, 84, 84, 84, 84, 84, 84)
  5. 我想将2015年设置为分段线性拟合的分界点(2008-20152015-2022)。我尝试了以下代码,得到了下面的结果。然而,我认为结果不正确,特别是第二阶段,它应该是一个增长趋势。
  6. stage1 <- year - 2008
  7. stage2 <- (year - 2015) * (year >= 2015)
  8. fm <- lm(score ~ stage1 + stage2)
  9. summary(fm)
  10. library(car)
  11. linearHypothesis(fm, "stage1 + stage2", verbose = TRUE)
  12. plot(score ~ year)
  13. lines(fitted(fm) ~ year, col = "red")
  14. abline(v = 2015, lty = 2)
  15. 分段线性拟合结果
  16. ![分段线性拟合结果][1]

请注意,代码部分未被翻译,只翻译了您提供的文本信息。

英文:

I am new to R. I want to ask a question below:

Here is the data:

  1. year &lt;- c(2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022)
  2. score &lt;- c(85, 85, 88, 88, 94, 94, 94, 82, 82, 84, 84, 84, 84, 84, 84)

I want to set 2015 as the breakpoint to do the piecewise linear fitting (2008-2015 and 2015-2022). I have tried the following code, it get the results below. However, I think the result is not correct, especially for stage2, which shoube an increasing trend.

  1. stage1 &lt;- year - 2008
  2. stage2 &lt;- (year - 2015) * (year &gt;= 2015)
  3. fm &lt;- lm(score~ stage1 + stage2)
  4. summary(fm)
  5. library(car)
  6. linearHypothesis(fm, &quot;stage1 + stage2&quot;, verbose = TRUE)
  7. plot(score ~ year)
  8. lines(fitted(fm) ~ year, col = &quot;red&quot;)
  9. abline(v = 2015, lty = 2)

The piecewise linaer fitting result
关于分段拟合有什么问题?

答案1

得分: 2

以下是翻译好的内容:

主要问题是使用了不合适的模型。该模型描述了从Stack Overflow帖子中获取的数据(https://stackoverflow.com/questions/76480532/how-can-i-set-the-breakpoints-myself-to-do-the-piecewise-linear-fitting-with-man/76482328#76482328),但不适用于这份数据。在这种情况下,由不连续而不是连续线段组成的模型似乎更合适。

  1. stage1.slope <- (year < 2015) * (year - 2015)
  2. stage1.icept <- +(year < 2015)
  3. stage2.slope <- (year >= 2015) * (year - 2015)
  4. stage2.icept <- +(year >= 2015)
  5. fm <- lm(score ~ stage1.icept + stage1.slope + stage2.icept + stage2.slope + 0)
  6. summary(fm)
  7. ## Call:
  8. ## lm(formula = score ~ stage1.icept + stage1.slope + stage2.icept +
  9. ## stage2.slope + 0)
  10. ##
  11. ## Residuals:
  12. ## Min 1Q Median 3Q Max
  13. ## -1.71429 -0.64286 0.07143 0.64286 2.46429
  14. ##
  15. ## Coefficients:
  16. ## Estimate Std. Error t value Pr(>|t|)
  17. ## stage1.icept 97.0000 0.9904 97.936 < 2e-16 ***
  18. ## stage1.slope 1.8214 0.2215 8.224 5.02e-06 ***
  19. ## stage2.icept 82.5000 0.7565 109.060 < 2e-16 ***
  20. ## stage2.slope 0.2857 0.1808 1.580 0.142
  21. ## ---
  22. ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  23. ##
  24. ## Residual standard error: 1.172 on 11 degrees of freedom
  25. ## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9998
  26. ## F-statistic: 2.043e+04 on 4 and 11 DF, p-value: < 2.2e-16

或者,考虑到stage2.slope不显著,我们可以考虑删除该项。我们可以选择将fm2<-行替换为等效的已注释行。

  1. # fm2 <- update(fm, . ~ . - stage2.slope)
  2. fm2 <- lm(score ~ stage1.icept + stage1.slope + stage2.icept + 0)
  3. summary(fm2)
  4. ## Call:
  5. ## lm(formula = score ~ stage1.icept + stage1.slope + stage2.icept + 0
  6. ##
  7. ## Residuals:
  8. ## Min 1Q Median 3Q Max
  9. ## -1.714 -1.125 0.500 0.500 2.464
  10. ##
  11. ## Coefficients:
  12. ## Estimate Std. Error t value Pr(>|t|)
  13. ## stage1.icept 97.0000 1.0504 92.347 < 2e-16 ***
  14. ## stage1.slope 1.8214 0.2349 7.755 5.16e-06 ***
  15. ## stage2.icept 83.5000 0.4394 190.028 < 2e-16 ***
  16. ## ---
  17. ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  18. ##
  19. ## Residual standard error: 1.243 on 12 degrees of freedom
  20. ## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
  21. ## F-statistic: 2.422e+04 on 3 and 12 DF, p-value: < 2.2e-16

绘制图形并添加图例:

  1. plot(score ~ year)
  2. lines(fitted(fm) ~ year, col = "red")
  3. lines(fitted(fm2) ~ year, col = "blue", lty = 2, lwd = 2)
  4. legend("topright", c("fm", "fm2"), col = c("red", "blue"), lty = 1:2, lwd = 1:2)

希望这对你有所帮助!

英文:

The main problem is using an inappropriate model. The model described the data in the SO post it was taken from (https://stackoverflow.com/questions/76480532/how-can-i-set-the-breakpoints-myself-to-do-the-piecewise-linear-fitting-with-man/76482328#76482328) but not this data. In this case a model consisting of discontinuous rather than continuous line segments seems more appropriate.

  1. stage1.slope &lt;- (year &lt; 2015) * (year - 2015)
  2. stage1.icept &lt;- +(year &lt; 2015)
  3. stage2.slope &lt;- (year &gt;= 2015) * (year - 2015)
  4. stage2.icept &lt;- +(year &gt;= 2015)
  5. fm &lt;- lm(score ~ stage1.icept + stage1.slope + stage2.icept + stage2.slope + 0)
  6. summary(fm)
  7. ## Call:
  8. ## lm(formula = score ~ stage1.icept + stage1.slope + stage2.icept +
  9. ## stage2.slope + 0)
  10. ##
  11. ## Residuals:
  12. ## Min 1Q Median 3Q Max
  13. ## -1.71429 -0.64286 0.07143 0.64286 2.46429
  14. ##
  15. ## Coefficients:
  16. ## Estimate Std. Error t value Pr(&gt;|t|)
  17. ## stage1.icept 97.0000 0.9904 97.936 &lt; 2e-16 ***
  18. ## stage1.slope 1.8214 0.2215 8.224 5.02e-06 ***
  19. ## stage2.icept 82.5000 0.7565 109.060 &lt; 2e-16 ***
  20. ## stage2.slope 0.2857 0.1808 1.580 0.142
  21. ## ---
  22. ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  23. ##
  24. ## Residual standard error: 1.172 on 11 degrees of freedom
  25. ## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9998
  26. ## F-statistic: 2.043e+04 on 4 and 11 DF, p-value: &lt; 2.2e-16

or given that stage2.slope is not significant we could consider dropping that term. We can optionally replace the fm2<- line with the equivalent commented out line.

  1. # fm2 &lt;- update(fm, . ~ . - stage2.slope)
  2. fm2 &lt;- lm(score ~ stage1.icept + stage1.slope + stage2.icept + 0)
  3. summary(fm2)
  4. ## Call:
  5. ## lm(formula = score ~ stage1.icept + stage1.slope + stage2.icept + 0
  6. ##
  7. ## Residuals:
  8. ## Min 1Q Median 3Q Max
  9. ## -1.714 -1.125 0.500 0.500 2.464
  10. ##
  11. ## Coefficients:
  12. ## Estimate Std. Error t value Pr(&gt;|t|)
  13. ## stage1.icept 97.0000 1.0504 92.347 &lt; 2e-16 ***
  14. ## stage1.slope 1.8214 0.2349 7.755 5.16e-06 ***
  15. ## stage2.icept 83.5000 0.4394 190.028 &lt; 2e-16 ***
  16. ## ---
  17. ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  18. ##
  19. ## Residual standard error: 1.243 on 12 degrees of freedom
  20. ## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
  21. ## F-statistic: 2.422e+04 on 3 and 12 DF, p-value: &lt; 2.2e-16
  22. plot(score ~ year)
  23. lines(fitted(fm) ~ year, col = &quot;red&quot;)
  24. lines(fitted(fm2) ~ year, col = &quot;blue&quot;, lty = 2, lwd = 2)
  25. legend(&quot;topright&quot;, c(&quot;fm&quot;, &quot;fm2&quot;), col = c(&quot;red&quot;, &quot;blue&quot;), lty = 1:2, lwd = 1:2)

关于分段拟合有什么问题?

答案2

得分: 0

给定你的数据框为 d

  1. d <-
  2. data.frame(
  3. year = c(2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022),
  4. score = c(85, 85, 88, 88, 94, 94, 94, 82, 82, 84, 84, 84, 84, 84, 84)
  5. )
  • 通过基础R,你可以绘制整个数据集,然后在所需的点处拆分数据,并将结果的数据帧列表映射到适当的对象列表(这里是ablines用于单独模型的系数)。在进行操作时添加ablines:
  1. plot(score ~ year, data = d)
  2. d %>%
  3. split(list(d$year >= 2015)) %>%
  4. Map(f = \(chunk) abline(coef(lm(score ~ year, data = chunk))))

关于分段拟合有什么问题?

  • 或者,你可以在ggplot中使用分组线性平滑的分组标准:
  1. library(ggplot2)
  2. d %>%
  3. ggplot(aes(year, score, group = year >= 2015)) +
  4. geom_point() +
  5. geom_smooth(method = 'lm',
  6. se = FALSE ## 隐藏置信区间
  7. )

关于分段拟合有什么问题?

英文:

given your data as dataframe d:

  1. d &lt;-
  2. data.frame(
  3. year = c(2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022),
  4. score = c(85, 85, 88, 88, 94, 94, 94, 82, 82, 84, 84, 84, 84, 84, 84)
  5. )
  • with base R, you could plot the whole set, then split the data at the desired point, and Map the resulting list of dataframes to a list of appropriate objects (here: the ablines for the coefficients of separate models). Add the ablines as you go along:
  1. plot(score ~ year, data = d)
  2. d |&gt;
  3. split(list(d$year &gt;= 2015)) |&gt;
  4. Map(f = \(chunk) abline(coef(lm(score ~ year, data = chunk))))

关于分段拟合有什么问题?

  • alternative, you could use the split criterion for groupwise linear smoothing in ggplot:
  1. library(ggplot2)
  2. d |&gt;
  3. ggplot(aes(year, score, group = year &gt;= 2015)) +
  4. geom_point() +
  5. geom_smooth(method = &#39;lm&#39;,
  6. se = FALSE ## hide confidence bands
  7. )

关于分段拟合有什么问题?

huangapple
  • 本文由 发表于 2023年7月3日 10:53:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76601598.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定