英文:
Regression with dummy variable multiplied by another
问题
Approach 1 摘要(lm(Salary~Variable1, df%>%筛选(Country=="USA")))
Approach 2 摘要(lm(Salary~I(Variable1*Dummy_USA), df))
英文:
What is the difference between approach 1 and approach 2 below?
I was thinking that 'I()' will allow us to multiple 2 variables and to not included interaction, but here it is not working as expected. Do I undertsand correct that the 2nd approach takes into account also three 0 (non-USA)? So the model is build on 6 points instead of 3 - can we somehow fix it?
df <- data.frame(
Salary=c(5, 1:2,4,1:2),
Variable1=c(500,490,501,460,490,505),
Variable2=c(5,10,0,3,17,40),
Country=c(rep("USA",3),rep("RPA",3)),
Dummy_USA=c(rep(1,3), rep(0,3))
)
# Approach 1
summary(lm(Salary~Variable1, df%>% filter(Country=="USA")))
# Approach 2
summary(lm(Salary~I(Variable1*Dummy_USA), df))
</details>
# 答案1
**得分**: 2
是的,第二个版本简单地对向量 `c(5, 1, 2, 4, 1, 2)` 进行回归分析,用向量 `c(500, 490, 501, 0, 0, 0)`。这与第一个版本非常不同,第一个版本是对向量 `c(5, 1, 2)` 进行回归分析,用向量 `c(500, 490, 501)`。
如果你想使用虚拟变量,你可以将它传递给 `lm` 函数的 `subset` 参数或 `weights` 参数。
``` r
with(df, summary(lm(Salary ~ Variable1, subset = Dummy_USA == 1)))
#>
#> Call:
#> lm(formula = Salary ~ Variable1, subset = Dummy_USA == 1)
#>
#> Residuals:
#> 1 2 3
#> 1.6847 -0.1532 -1.5315
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -104.7928 131.8453 -0.795 0.572
#> Variable1 0.2162 0.2653 0.815 0.565
#>
#> Residual standard error: 2.282 on 1 degrees of freedom
#> Multiple R-squared: 0.3992, Adjusted R-squared: -0.2017
#> F-statistic: 0.6644 on 1 and 1 DF, p-value: 0.5646
或者
with(df, summary(lm(Salary ~ Variable1, weights = Dummy_USA)))
#>
#> Call:
#> lm(formula = Salary ~ Variable1, weights = Dummy_USA)
#>
#> Weighted Residuals:
#> 1 2 3 4 5 6
#> 1.6847 -0.1532 -1.5315 0.0000 0.0000 0.0000
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -104.7928 131.8453 -0.795 0.572
#> Variable1 0.2162 0.2653 0.815 0.565
#>
#> Residual standard error: 2.282 on 1 degrees of freedom
#> Multiple R-squared: 0.3992, Adjusted R-squared: -0.2017
#> F-statistic: 0.6644 on 1 and 1 DF, p-value: 0.5646
英文:
Yes, the second version simply regresses the vector c(5, 1, 2, 4, 1, 2) on the vector c(500, 490, 501, 0, 0, 0). This is very different from the first version, which regresses the vector c(5, 1, 2) in the vector c(500, 490, 501).
If you want to use a dummy variable you could either pass it to the subset argument of lm or the weights argument.
with(df, summary(lm(Salary ~ Variable1, subset = Dummy_USA == 1)))
#>
#> Call:
#> lm(formula = Salary ~ Variable1, subset = Dummy_USA == 1)
#>
#> Residuals:
#> 1 2 3
#> 1.6847 -0.1532 -1.5315
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -104.7928 131.8453 -0.795 0.572
#> Variable1 0.2162 0.2653 0.815 0.565
#>
#> Residual standard error: 2.282 on 1 degrees of freedom
#> Multiple R-squared: 0.3992, Adjusted R-squared: -0.2017
#> F-statistic: 0.6644 on 1 and 1 DF, p-value: 0.5646
or
with(df, summary(lm(Salary ~ Variable1, weights = Dummy_USA)))
#>
#> Call:
#> lm(formula = Salary ~ Variable1, weights = Dummy_USA)
#>
#> Weighted Residuals:
#> 1 2 3 4 5 6
#> 1.6847 -0.1532 -1.5315 0.0000 0.0000 0.0000
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -104.7928 131.8453 -0.795 0.572
#> Variable1 0.2162 0.2653 0.815 0.565
#>
#> Residual standard error: 2.282 on 1 degrees of freedom
#> Multiple R-squared: 0.3992, Adjusted R-squared: -0.2017
#> F-statistic: 0.6644 on 1 and 1 DF, p-value: 0.5646
<sup>Created on 2023-03-20 with reprex v2.0.2</sup>
答案2
得分: 2
为了方便引用,让我们将"Salary"和"Variable1"分别表示为y和x,同时让b0和b1表示截距和斜率。然后,第一个lm不涉及y[4]、y[5]和y[6],但第二个涉及它们。
具体来说,第一个lm在b0和b1上最小化以下内容:
(y[1] - b0 - b1 * x[1])^2 + (y[2] - b0 - b1 * x[2])^2 + (y[3] - b0 - b1 * x[3])^2
而第二个在此基础上还最小化以下内容:
(y[4] - b0)^2 + (y[5] - b0)^2 + (y[6] - b0)^2
英文:
To make this easier to refer to let us refer to Salary and Variable1
as y and x respectively and let b0 and b1 be the intercept and slope. Then
the first lm does not involve y[4], y[5], y[6] but the second one does.
In particular, the first lm is minimizing the following over b0 and b1
(y[1] - b0 - b1 * x[1])^2 + (y[2] - b0 - b1 * x[2])^2 + (y[3] - b0 - b1 * x[3])^2
whereas the second one is minimizing that plus
(y[4] - b0)^2 + (y[5] - b0)^2 + (y[6] - b0)^2
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论