英文:
What's wrong with the piecewise fitting
问题
以下是您提供的代码的翻译部分:
我是R的新手。我想问一个问题:
这是数据:
year <- c(2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022)
score <- c(85, 85, 88, 88, 94, 94, 94, 82, 82, 84, 84, 84, 84, 84, 84)
我想将2015年设置为分段线性拟合的分界点(2008-2015和2015-2022)。我尝试了以下代码,得到了下面的结果。然而,我认为结果不正确,特别是第二阶段,它应该是一个增长趋势。
stage1 <- year - 2008
stage2 <- (year - 2015) * (year >= 2015)
fm <- lm(score ~ stage1 + stage2)
summary(fm)
library(car)
linearHypothesis(fm, "stage1 + stage2", verbose = TRUE)
plot(score ~ year)
lines(fitted(fm) ~ year, col = "red")
abline(v = 2015, lty = 2)
分段线性拟合结果
![分段线性拟合结果][1]
请注意,代码部分未被翻译,只翻译了您提供的文本信息。
英文:
I am new to R. I want to ask a question below:
Here is the data:
year <- c(2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022)
score <- c(85, 85, 88, 88, 94, 94, 94, 82, 82, 84, 84, 84, 84, 84, 84)
I want to set 2015 as the breakpoint to do the piecewise linear fitting (2008-2015 and 2015-2022). I have tried the following code, it get the results below. However, I think the result is not correct, especially for stage2, which shoube an increasing trend.
stage1 <- year - 2008
stage2 <- (year - 2015) * (year >= 2015)
fm <- lm(score~ stage1 + stage2)
summary(fm)
library(car)
linearHypothesis(fm, "stage1 + stage2", verbose = TRUE)
plot(score ~ year)
lines(fitted(fm) ~ year, col = "red")
abline(v = 2015, lty = 2)
The piecewise linaer fitting result
答案1
得分: 2
以下是翻译好的内容:
主要问题是使用了不合适的模型。该模型描述了从Stack Overflow帖子中获取的数据(https://stackoverflow.com/questions/76480532/how-can-i-set-the-breakpoints-myself-to-do-the-piecewise-linear-fitting-with-man/76482328#76482328),但不适用于这份数据。在这种情况下,由不连续而不是连续线段组成的模型似乎更合适。
stage1.slope <- (year < 2015) * (year - 2015)
stage1.icept <- +(year < 2015)
stage2.slope <- (year >= 2015) * (year - 2015)
stage2.icept <- +(year >= 2015)
fm <- lm(score ~ stage1.icept + stage1.slope + stage2.icept + stage2.slope + 0)
summary(fm)
## Call:
## lm(formula = score ~ stage1.icept + stage1.slope + stage2.icept +
## stage2.slope + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.71429 -0.64286 0.07143 0.64286 2.46429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## stage1.icept 97.0000 0.9904 97.936 < 2e-16 ***
## stage1.slope 1.8214 0.2215 8.224 5.02e-06 ***
## stage2.icept 82.5000 0.7565 109.060 < 2e-16 ***
## stage2.slope 0.2857 0.1808 1.580 0.142
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.172 on 11 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9998
## F-statistic: 2.043e+04 on 4 and 11 DF, p-value: < 2.2e-16
或者,考虑到stage2.slope
不显著,我们可以考虑删除该项。我们可以选择将fm2<-
行替换为等效的已注释行。
# fm2 <- update(fm, . ~ . - stage2.slope)
fm2 <- lm(score ~ stage1.icept + stage1.slope + stage2.icept + 0)
summary(fm2)
## Call:
## lm(formula = score ~ stage1.icept + stage1.slope + stage2.icept + 0
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.714 -1.125 0.500 0.500 2.464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## stage1.icept 97.0000 1.0504 92.347 < 2e-16 ***
## stage1.slope 1.8214 0.2349 7.755 5.16e-06 ***
## stage2.icept 83.5000 0.4394 190.028 < 2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.243 on 12 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
## F-statistic: 2.422e+04 on 3 and 12 DF, p-value: < 2.2e-16
绘制图形并添加图例:
plot(score ~ year)
lines(fitted(fm) ~ year, col = "red")
lines(fitted(fm2) ~ year, col = "blue", lty = 2, lwd = 2)
legend("topright", c("fm", "fm2"), col = c("red", "blue"), lty = 1:2, lwd = 1:2)
希望这对你有所帮助!
英文:
The main problem is using an inappropriate model. The model described the data in the SO post it was taken from (https://stackoverflow.com/questions/76480532/how-can-i-set-the-breakpoints-myself-to-do-the-piecewise-linear-fitting-with-man/76482328#76482328) but not this data. In this case a model consisting of discontinuous rather than continuous line segments seems more appropriate.
stage1.slope <- (year < 2015) * (year - 2015)
stage1.icept <- +(year < 2015)
stage2.slope <- (year >= 2015) * (year - 2015)
stage2.icept <- +(year >= 2015)
fm <- lm(score ~ stage1.icept + stage1.slope + stage2.icept + stage2.slope + 0)
summary(fm)
## Call:
## lm(formula = score ~ stage1.icept + stage1.slope + stage2.icept +
## stage2.slope + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.71429 -0.64286 0.07143 0.64286 2.46429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## stage1.icept 97.0000 0.9904 97.936 < 2e-16 ***
## stage1.slope 1.8214 0.2215 8.224 5.02e-06 ***
## stage2.icept 82.5000 0.7565 109.060 < 2e-16 ***
## stage2.slope 0.2857 0.1808 1.580 0.142
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.172 on 11 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9998
## F-statistic: 2.043e+04 on 4 and 11 DF, p-value: < 2.2e-16
or given that stage2.slope
is not significant we could consider dropping that term. We can optionally replace the fm2<- line with the equivalent commented out line.
# fm2 <- update(fm, . ~ . - stage2.slope)
fm2 <- lm(score ~ stage1.icept + stage1.slope + stage2.icept + 0)
summary(fm2)
## Call:
## lm(formula = score ~ stage1.icept + stage1.slope + stage2.icept + 0
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.714 -1.125 0.500 0.500 2.464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## stage1.icept 97.0000 1.0504 92.347 < 2e-16 ***
## stage1.slope 1.8214 0.2349 7.755 5.16e-06 ***
## stage2.icept 83.5000 0.4394 190.028 < 2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.243 on 12 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
## F-statistic: 2.422e+04 on 3 and 12 DF, p-value: < 2.2e-16
plot(score ~ year)
lines(fitted(fm) ~ year, col = "red")
lines(fitted(fm2) ~ year, col = "blue", lty = 2, lwd = 2)
legend("topright", c("fm", "fm2"), col = c("red", "blue"), lty = 1:2, lwd = 1:2)
答案2
得分: 0
给定你的数据框为 d:
d <-
data.frame(
year = c(2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022),
score = c(85, 85, 88, 88, 94, 94, 94, 82, 82, 84, 84, 84, 84, 84, 84)
)
- 通过基础R,你可以绘制整个数据集,然后在所需的点处拆分数据,并将结果的数据帧列表映射到适当的对象列表(这里是
ablines
用于单独模型的系数)。在进行操作时添加ablines:
plot(score ~ year, data = d)
d %>%
split(list(d$year >= 2015)) %>%
Map(f = \(chunk) abline(coef(lm(score ~ year, data = chunk))))
- 或者,你可以在
ggplot
中使用分组线性平滑的分组标准:
library(ggplot2)
d %>%
ggplot(aes(year, score, group = year >= 2015)) +
geom_point() +
geom_smooth(method = 'lm',
se = FALSE ## 隐藏置信区间
)
英文:
given your data as dataframe d:
d <-
data.frame(
year = c(2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022),
score = c(85, 85, 88, 88, 94, 94, 94, 82, 82, 84, 84, 84, 84, 84, 84)
)
- with base R, you could plot the whole set, then
split
the data at the desired point, andMap
the resulting list of dataframes to a list of appropriate objects (here: theablines
for thecoef
ficients of separate models). Add the ablines as you go along:
plot(score ~ year, data = d)
d |>
split(list(d$year >= 2015)) |>
Map(f = \(chunk) abline(coef(lm(score ~ year, data = chunk))))
- alternative, you could use the split criterion for groupwise linear smoothing in
ggplot
:
library(ggplot2)
d |>
ggplot(aes(year, score, group = year >= 2015)) +
geom_point() +
geom_smooth(method = 'lm',
se = FALSE ## hide confidence bands
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论