英文:
Regression with dummy variable multiplied by another
问题
Approach 1 摘要(lm(Salary~Variable1, df%>%筛选(Country=="USA")))
Approach 2 摘要(lm(Salary~I(Variable1*Dummy_USA), df))
英文:
What is the difference between approach 1 and approach 2 below?
I was thinking that 'I()' will allow us to multiple 2 variables and to not included interaction, but here it is not working as expected. Do I undertsand correct that the 2nd approach takes into account also three 0 (non-USA)? So the model is build on 6 points instead of 3 - can we somehow fix it?
df <- data.frame(
Salary=c(5, 1:2,4,1:2),
Variable1=c(500,490,501,460,490,505),
Variable2=c(5,10,0,3,17,40),
Country=c(rep("USA",3),rep("RPA",3)),
Dummy_USA=c(rep(1,3), rep(0,3))
)
# Approach 1
summary(lm(Salary~Variable1, df%>% filter(Country=="USA")))
# Approach 2
summary(lm(Salary~I(Variable1*Dummy_USA), df))
</details>
# 答案1
**得分**: 2
是的,第二个版本简单地对向量 `c(5, 1, 2, 4, 1, 2)` 进行回归分析,用向量 `c(500, 490, 501, 0, 0, 0)`。这与第一个版本非常不同,第一个版本是对向量 `c(5, 1, 2)` 进行回归分析,用向量 `c(500, 490, 501)`。
如果你想使用虚拟变量,你可以将它传递给 `lm` 函数的 `subset` 参数或 `weights` 参数。
``` r
with(df, summary(lm(Salary ~ Variable1, subset = Dummy_USA == 1)))
#>
#> Call:
#> lm(formula = Salary ~ Variable1, subset = Dummy_USA == 1)
#>
#> Residuals:
#> 1 2 3
#> 1.6847 -0.1532 -1.5315
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -104.7928 131.8453 -0.795 0.572
#> Variable1 0.2162 0.2653 0.815 0.565
#>
#> Residual standard error: 2.282 on 1 degrees of freedom
#> Multiple R-squared: 0.3992, Adjusted R-squared: -0.2017
#> F-statistic: 0.6644 on 1 and 1 DF, p-value: 0.5646
或者
with(df, summary(lm(Salary ~ Variable1, weights = Dummy_USA)))
#>
#> Call:
#> lm(formula = Salary ~ Variable1, weights = Dummy_USA)
#>
#> Weighted Residuals:
#> 1 2 3 4 5 6
#> 1.6847 -0.1532 -1.5315 0.0000 0.0000 0.0000
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -104.7928 131.8453 -0.795 0.572
#> Variable1 0.2162 0.2653 0.815 0.565
#>
#> Residual standard error: 2.282 on 1 degrees of freedom
#> Multiple R-squared: 0.3992, Adjusted R-squared: -0.2017
#> F-statistic: 0.6644 on 1 and 1 DF, p-value: 0.5646
英文:
Yes, the second version simply regresses the vector c(5, 1, 2, 4, 1, 2)
on the vector c(500, 490, 501, 0, 0, 0)
. This is very different from the first version, which regresses the vector c(5, 1, 2)
in the vector c(500, 490, 501)
.
If you want to use a dummy variable you could either pass it to the subset
argument of lm
or the weights
argument.
with(df, summary(lm(Salary ~ Variable1, subset = Dummy_USA == 1)))
#>
#> Call:
#> lm(formula = Salary ~ Variable1, subset = Dummy_USA == 1)
#>
#> Residuals:
#> 1 2 3
#> 1.6847 -0.1532 -1.5315
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -104.7928 131.8453 -0.795 0.572
#> Variable1 0.2162 0.2653 0.815 0.565
#>
#> Residual standard error: 2.282 on 1 degrees of freedom
#> Multiple R-squared: 0.3992, Adjusted R-squared: -0.2017
#> F-statistic: 0.6644 on 1 and 1 DF, p-value: 0.5646
or
with(df, summary(lm(Salary ~ Variable1, weights = Dummy_USA)))
#>
#> Call:
#> lm(formula = Salary ~ Variable1, weights = Dummy_USA)
#>
#> Weighted Residuals:
#> 1 2 3 4 5 6
#> 1.6847 -0.1532 -1.5315 0.0000 0.0000 0.0000
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -104.7928 131.8453 -0.795 0.572
#> Variable1 0.2162 0.2653 0.815 0.565
#>
#> Residual standard error: 2.282 on 1 degrees of freedom
#> Multiple R-squared: 0.3992, Adjusted R-squared: -0.2017
#> F-statistic: 0.6644 on 1 and 1 DF, p-value: 0.5646
<sup>Created on 2023-03-20 with reprex v2.0.2</sup>
答案2
得分: 2
为了方便引用,让我们将"Salary"和"Variable1"分别表示为y和x,同时让b0和b1表示截距和斜率。然后,第一个lm
不涉及y[4]、y[5]和y[6],但第二个涉及它们。
具体来说,第一个lm
在b0和b1上最小化以下内容:
(y[1] - b0 - b1 * x[1])^2 + (y[2] - b0 - b1 * x[2])^2 + (y[3] - b0 - b1 * x[3])^2
而第二个在此基础上还最小化以下内容:
(y[4] - b0)^2 + (y[5] - b0)^2 + (y[6] - b0)^2
英文:
To make this easier to refer to let us refer to Salary and Variable1
as y and x respectively and let b0 and b1 be the intercept and slope. Then
the first lm
does not involve y[4], y[5], y[6] but the second one does.
In particular, the first lm
is minimizing the following over b0 and b1
(y[1] - b0 - b1 * x[1])^2 + (y[2] - b0 - b1 * x[2])^2 + (y[3] - b0 - b1 * x[3])^2
whereas the second one is minimizing that plus
(y[4] - b0)^2 + (y[5] - b0)^2 + (y[6] - b0)^2
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论