当自变量和因变量相同时,线性模型不会产生斜率或R平方值。

huangapple go评论78阅读模式
英文:

linear model not producing slope or r squared when independent and dependent variable are the same

问题

我有一个数据框,并正在运行线性回归。当将同一变量用作自变量和因变量时,线性模型的摘要没有返回预期的斜率和R平方值为1,而只提供了模型的截距。为什么当自变量和因变量相同时,不返回斜率和R平方为1呢?

aa <- data.frame(x = rnorm(10, 100, 5),
                 y = rnorm(10,500, 2))

lm_mod1 <- lm(y~x, data = aa)
summary(lm_mod1) # 正常工作,返回斜率和R平方值
#> 
#> Call:
#> lm(formula = y ~ x, data = aa)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.0241 -1.3874  0.5264  1.7933  2.2946 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 489.7413    14.2402  34.391 5.59e-10 ***
#> x             0.1008     0.1428   0.706      0.5    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.491 on 8 degrees of freedom
#> Multiple R-squared:  0.05862,    Adjusted R-squared:  -0.05905 
#> F-statistic: 0.4982 on 1 and 8 DF,  p-value: 0.5003

lm_mod2 <- lm(x~x, data = aa)
#> Warning in model.matrix.default(mt, mf, contrasts): the response appeared on the
#> right-hand side and was dropped
#> Warning in model.matrix.default(mt, mf, contrasts): problem with term 1 in
#> model.matrix: no columns are assigned
summary(lm_mod2) # 不返回斜率和R平方值
#> 
#> Call:
#> lm(formula = x ~ x, data = aa)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -10.6993  -2.9903  -0.6496   3.2495   8.4294 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   99.554      1.838   54.16 1.25e-12 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 5.813 on 9 degrees of freedom

创建于2023-08-08,使用reprex v2.0.2

英文:

I have a data frame and am running a linear regression. When the same variable is used as an independent and dependent variable, the summary of the linear model does not return the expected slope and r-squared values of 1. Rather only the y-intercept of the model is provided. Why is a slope and r-squared of 1 not returned when the independent and dependent variables are the same?

aa &lt;- data.frame(x = rnorm(10, 100, 5),
                 y = rnorm(10,500, 2))

lm_mod1 &lt;- lm(y~x, data = aa)
summary(lm_mod1) # works as it should, returning a slope and r-squared value
#&gt; 
#&gt; Call:
#&gt; lm(formula = y ~ x, data = aa)
#&gt; 
#&gt; Residuals:
#&gt;     Min      1Q  Median      3Q     Max 
#&gt; -4.0241 -1.3874  0.5264  1.7933  2.2946 
#&gt; 
#&gt; Coefficients:
#&gt;             Estimate Std. Error t value Pr(&gt;|t|)    
#&gt; (Intercept) 489.7413    14.2402  34.391 5.59e-10 ***
#&gt; x             0.1008     0.1428   0.706      0.5    
#&gt; ---
#&gt; Signif. codes:  0 &#39;***&#39; 0.001 &#39;**&#39; 0.01 &#39;*&#39; 0.05 &#39;.&#39; 0.1 &#39; &#39; 1
#&gt; 
#&gt; Residual standard error: 2.491 on 8 degrees of freedom
#&gt; Multiple R-squared:  0.05862,    Adjusted R-squared:  -0.05905 
#&gt; F-statistic: 0.4982 on 1 and 8 DF,  p-value: 0.5003

lm_mod2 &lt;- lm(x~x, data = aa)
#&gt; Warning in model.matrix.default(mt, mf, contrasts): the response appeared on the
#&gt; right-hand side and was dropped
#&gt; Warning in model.matrix.default(mt, mf, contrasts): problem with term 1 in
#&gt; model.matrix: no columns are assigned
summary(lm_mod2) # does not return a slope or r-squared value
#&gt; 
#&gt; Call:
#&gt; lm(formula = x ~ x, data = aa)
#&gt; 
#&gt; Residuals:
#&gt;      Min       1Q   Median       3Q      Max 
#&gt; -10.6993  -2.9903  -0.6496   3.2495   8.4294 
#&gt; 
#&gt; Coefficients:
#&gt;             Estimate Std. Error t value Pr(&gt;|t|)    
#&gt; (Intercept)   99.554      1.838   54.16 1.25e-12 ***
#&gt; ---
#&gt; Signif. codes:  0 &#39;***&#39; 0.001 &#39;**&#39; 0.01 &#39;*&#39; 0.05 &#39;.&#39; 0.1 &#39; &#39; 1
#&gt; 
#&gt; Residual standard error: 5.813 on 9 degrees of freedom

<sup>Created on 2023-08-08 with reprex v2.0.2</sup>

答案1

得分: 2

在警告中告诉你:
> 响应出现在右侧并被删除

所以你的公式实际上变成了 x ~ 1(只是 x 的均值估计)。

这是有意为之的。如果你想规避这个问题,你可以这样做:

aa$z <- aa$x

summary(lm(x ~ z, data = aa))
#> Call:
#> lm(formula = x ~ z, data = aa)
#> 
#> Coefficients:
#> (Intercept)            z  
#>  -1.798e-14    1.000e+00  
#> 
#> > summary(lm(x ~ z, data = aa))
#> 
#> Call:
#> lm(formula = x ~ z, data = aa)
#> 
#> Residuals:
#>        Min         1Q     Median         3Q        Max 
#> -1.582e-15 -4.789e-16 -2.258e-16  6.682e-16  1.553e-15 
#> 
#> Coefficients:
#>               Estimate Std. Error    t value Pr(>|t|)    
#> (Intercept) -1.798e-14  6.334e-15 -2.838e+00   0.0219 *  
#> z            1.000e+00  6.534e-17  1.530e+16   <2e-16 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 9.623e-16 on 8 degrees of freedom
#> Multiple R-squared:      1,	Adjusted R-squared:      1 
#> F-statistic: 2.342e+32 on 1 and 8 DF,  p-value: < 2.2e-16
#> 
#> Warning message:
#> In summary.lm(lm(x ~ z, data = aa)) :
#>   essentially perfect fit: summary may be unreliable

你会看到确实得到了斜率为1和R平方为1的结果,同时还有一个警告,表示拟合非常完美,因此摘要可能不可靠。

这是一种特性,而不是错误;右侧同时也在左侧的变量会被主动寻找并删除,并且不清楚为什么你要这样做。

好奇吗?

英文:

It tells you in the warning :
> the response appeared on the right hand side and was dropped

so your formula effectively becomes x ~ 1 (an estimate of the mean of x only).

This is done on purpose. If you want to circumvent it, you can do

aa$z &lt;- aa$x

summary(lm(x ~ z, data = aa))
#&gt; Call:
#&gt; lm(formula = x ~ z, data = aa)
#&gt; 
#&gt; Coefficients:
#&gt; (Intercept)            z  
#&gt;  -1.798e-14    1.000e+00  
#&gt; 
#&gt; &gt; summary(lm(x ~ z, data = aa))
#&gt; 
#&gt; Call:
#&gt; lm(formula = x ~ z, data = aa)
#&gt; 
#&gt; Residuals:
#&gt;        Min         1Q     Median         3Q        Max 
#&gt; -1.582e-15 -4.789e-16 -2.258e-16  6.682e-16  1.553e-15 
#&gt; 
#&gt; Coefficients:
#&gt;               Estimate Std. Error    t value Pr(&gt;|t|)    
#&gt; (Intercept) -1.798e-14  6.334e-15 -2.838e+00   0.0219 *  
#&gt; z            1.000e+00  6.534e-17  1.530e+16   &lt;2e-16 ***
#&gt; ---
#&gt; Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#&gt; 
#&gt; Residual standard error: 9.623e-16 on 8 degrees of freedom
#&gt; Multiple R-squared:      1,	Adjusted R-squared:      1 
#&gt; F-statistic: 2.342e+32 on 1 and 8 DF,  p-value: &lt; 2.2e-16
#&gt; 
#&gt; Warning message:
#&gt; In summary.lm(lm(x ~ z, data = aa)) :
#&gt;   essentially perfect fit: summary may be unreliable

You will see you do indeed get a slope of 1 and an r squared of 1, along with a warning that the fit is perfect and the summary may therefore be unreliable.

This is a feature, not a bug; variables on the right hand side that are also on the left are actively sought and dropped, and it's not clear why you would want to do this anyway.

Curiosity?

huangapple
  • 本文由 发表于 2023年8月9日 00:52:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76861671.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定