回归与乘以另一个虚拟变量

huangapple go评论71阅读模式
英文:

Regression with dummy variable multiplied by another

问题

Approach 1 摘要(lm(Salary~Variable1, df%>%筛选(Country=="USA")))

Approach 2 摘要(lm(Salary~I(Variable1*Dummy_USA), df))

英文:

What is the difference between approach 1 and approach 2 below?
I was thinking that 'I()' will allow us to multiple 2 variables and to not included interaction, but here it is not working as expected. Do I undertsand correct that the 2nd approach takes into account also three 0 (non-USA)? So the model is build on 6 points instead of 3 - can we somehow fix it?

df <- data.frame(
                 Salary=c(5, 1:2,4,1:2),
                 Variable1=c(500,490,501,460,490,505),
                 Variable2=c(5,10,0,3,17,40),
                 Country=c(rep("USA",3),rep("RPA",3)),
                 Dummy_USA=c(rep(1,3), rep(0,3))
)

# Approach 1
summary(lm(Salary~Variable1, df%>% filter(Country=="USA")))

# Approach 2
summary(lm(Salary~I(Variable1*Dummy_USA), df))

</details>


# 答案1
**得分**: 2

是的,第二个版本简单地对向量 `c(5, 1, 2, 4, 1, 2)` 进行回归分析,用向量 `c(500, 490, 501, 0, 0, 0)`。这与第一个版本非常不同,第一个版本是对向量 `c(5, 1, 2)` 进行回归分析,用向量 `c(500, 490, 501)`。

如果你想使用虚拟变量,你可以将它传递给 `lm` 函数的 `subset` 参数或 `weights` 参数。

``` r
with(df, summary(lm(Salary ~ Variable1, subset = Dummy_USA == 1)))
#&gt; 
#&gt; Call:
#&gt; lm(formula = Salary ~ Variable1, subset = Dummy_USA == 1)
#&gt; 
#&gt; Residuals:
#&gt;       1       2       3 
#&gt;  1.6847 -0.1532 -1.5315 
#&gt; 
#&gt; Coefficients:
#&gt;              Estimate Std. Error t value Pr(&gt;|t|)
#&gt; (Intercept) -104.7928   131.8453  -0.795    0.572
#&gt; Variable1      0.2162     0.2653   0.815    0.565
#&gt; 
#&gt; Residual standard error: 2.282 on 1 degrees of freedom
#&gt; Multiple R-squared:  0.3992, Adjusted R-squared:  -0.2017 
#&gt; F-statistic: 0.6644 on 1 and 1 DF,  p-value: 0.5646

或者

with(df, summary(lm(Salary ~ Variable1, weights = Dummy_USA)))
#&gt; 
#&gt; Call:
#&gt; lm(formula = Salary ~ Variable1, weights = Dummy_USA)
#&gt; 
#&gt; Weighted Residuals:
#&gt;       1       2       3       4       5       6 
#&gt;  1.6847 -0.1532 -1.5315  0.0000  0.0000  0.0000 
#&gt; 
#&gt; Coefficients:
#&gt;              Estimate Std. Error t value Pr(&gt;|t|)
#&gt; (Intercept) -104.7928   131.8453  -0.795    0.572
#&gt; Variable1      0.2162     0.2653   0.815    0.565
#&gt; 
#&gt; Residual standard error: 2.282 on 1 degrees of freedom
#&gt; Multiple R-squared:  0.3992, Adjusted R-squared:  -0.2017 
#&gt; F-statistic: 0.6644 on 1 and 1 DF,  p-value: 0.5646
英文:

Yes, the second version simply regresses the vector c(5, 1, 2, 4, 1, 2) on the vector c(500, 490, 501, 0, 0, 0). This is very different from the first version, which regresses the vector c(5, 1, 2) in the vector c(500, 490, 501).

If you want to use a dummy variable you could either pass it to the subset argument of lm or the weights argument.

with(df, summary(lm(Salary ~ Variable1, subset = Dummy_USA == 1)))
#&gt; 
#&gt; Call:
#&gt; lm(formula = Salary ~ Variable1, subset = Dummy_USA == 1)
#&gt; 
#&gt; Residuals:
#&gt;       1       2       3 
#&gt;  1.6847 -0.1532 -1.5315 
#&gt; 
#&gt; Coefficients:
#&gt;              Estimate Std. Error t value Pr(&gt;|t|)
#&gt; (Intercept) -104.7928   131.8453  -0.795    0.572
#&gt; Variable1      0.2162     0.2653   0.815    0.565
#&gt; 
#&gt; Residual standard error: 2.282 on 1 degrees of freedom
#&gt; Multiple R-squared:  0.3992, Adjusted R-squared:  -0.2017 
#&gt; F-statistic: 0.6644 on 1 and 1 DF,  p-value: 0.5646

or

with(df, summary(lm(Salary ~ Variable1, weights = Dummy_USA)))
#&gt; 
#&gt; Call:
#&gt; lm(formula = Salary ~ Variable1, weights = Dummy_USA)
#&gt; 
#&gt; Weighted Residuals:
#&gt;       1       2       3       4       5       6 
#&gt;  1.6847 -0.1532 -1.5315  0.0000  0.0000  0.0000 
#&gt; 
#&gt; Coefficients:
#&gt;              Estimate Std. Error t value Pr(&gt;|t|)
#&gt; (Intercept) -104.7928   131.8453  -0.795    0.572
#&gt; Variable1      0.2162     0.2653   0.815    0.565
#&gt; 
#&gt; Residual standard error: 2.282 on 1 degrees of freedom
#&gt; Multiple R-squared:  0.3992, Adjusted R-squared:  -0.2017 
#&gt; F-statistic: 0.6644 on 1 and 1 DF,  p-value: 0.5646

<sup>Created on 2023-03-20 with reprex v2.0.2</sup>

答案2

得分: 2

为了方便引用,让我们将"Salary"和"Variable1"分别表示为y和x,同时让b0和b1表示截距和斜率。然后,第一个lm不涉及y[4]、y[5]和y[6],但第二个涉及它们。

具体来说,第一个lm在b0和b1上最小化以下内容:

(y[1] - b0 - b1 * x[1])^2 + (y[2] - b0 - b1 * x[2])^2 + (y[3] - b0 - b1 * x[3])^2

而第二个在此基础上还最小化以下内容:

(y[4] - b0)^2 + (y[5] - b0)^2 + (y[6] - b0)^2

英文:

To make this easier to refer to let us refer to Salary and Variable1
as y and x respectively and let b0 and b1 be the intercept and slope. Then
the first lm does not involve y[4], y[5], y[6] but the second one does.

In particular, the first lm is minimizing the following over b0 and b1

(y[1] - b0 - b1 * x[1])^2 + (y[2] - b0 - b1 * x[2])^2 + (y[3] - b0 - b1 * x[3])^2

whereas the second one is minimizing that plus

(y[4] - b0)^2 + (y[5] - b0)^2 + (y[6] - b0)^2

huangapple
  • 本文由 发表于 2023年3月21日 01:17:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/75793363.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定