问题

Approach 1 摘要(lm(Salary~Variable1, df%>%筛选(Country=="USA")))

Approach 2 摘要(lm(Salary~I(Variable1*Dummy_USA), df))

英文:

What is the difference between approach 1 and approach 2 below?
I was thinking that 'I()' will allow us to multiple 2 variables and to not included interaction, but here it is not working as expected. Do I undertsand correct that the 2nd approach takes into account also three 0 (non-USA)? So the model is build on 6 points instead of 3 - can we somehow fix it?

df &lt;- data.frame(
                 Salary=c(5, 1:2,4,1:2),
                 Variable1=c(500,490,501,460,490,505),
                 Variable2=c(5,10,0,3,17,40),
                 Country=c(rep(&quot;USA&quot;,3),rep(&quot;RPA&quot;,3)),
                 Dummy_USA=c(rep(1,3), rep(0,3))
)

# Approach 1
summary(lm(Salary~Variable1, df%&gt;% filter(Country==&quot;USA&quot;)))

# Approach 2
summary(lm(Salary~I(Variable1*Dummy_USA), df))

</details>


# 答案1
**得分**: 2

是的，第二个版本简单地对向量 `c(5, 1, 2, 4, 1, 2)` 进行回归分析，用向量 `c(500, 490, 501, 0, 0, 0)`。这与第一个版本非常不同，第一个版本是对向量 `c(5, 1, 2)` 进行回归分析，用向量 `c(500, 490, 501)`。

如果你想使用虚拟变量，你可以将它传递给 `lm` 函数的 `subset` 参数或 `weights` 参数。

``` r
with(df, summary(lm(Salary ~ Variable1, subset = Dummy_USA == 1)))
#&gt; 
#&gt; Call:
#&gt; lm(formula = Salary ~ Variable1, subset = Dummy_USA == 1)
#&gt; 
#&gt; Residuals:
#&gt;       1       2       3 
#&gt;  1.6847 -0.1532 -1.5315 
#&gt; 
#&gt; Coefficients:
#&gt;              Estimate Std. Error t value Pr(&gt;|t|)
#&gt; (Intercept) -104.7928   131.8453  -0.795    0.572
#&gt; Variable1      0.2162     0.2653   0.815    0.565
#&gt; 
#&gt; Residual standard error: 2.282 on 1 degrees of freedom
#&gt; Multiple R-squared:  0.3992, Adjusted R-squared:  -0.2017 
#&gt; F-statistic: 0.6644 on 1 and 1 DF,  p-value: 0.5646

或者

with(df, summary(lm(Salary ~ Variable1, weights = Dummy_USA)))
#&gt; 
#&gt; Call:
#&gt; lm(formula = Salary ~ Variable1, weights = Dummy_USA)
#&gt; 
#&gt; Weighted Residuals:
#&gt;       1       2       3       4       5       6 
#&gt;  1.6847 -0.1532 -1.5315  0.0000  0.0000  0.0000 
#&gt; 
#&gt; Coefficients:
#&gt;              Estimate Std. Error t value Pr(&gt;|t|)
#&gt; (Intercept) -104.7928   131.8453  -0.795    0.572
#&gt; Variable1      0.2162     0.2653   0.815    0.565
#&gt; 
#&gt; Residual standard error: 2.282 on 1 degrees of freedom
#&gt; Multiple R-squared:  0.3992, Adjusted R-squared:  -0.2017 
#&gt; F-statistic: 0.6644 on 1 and 1 DF,  p-value: 0.5646

英文:

Yes, the second version simply regresses the vector c(5, 1, 2, 4, 1, 2) on the vector c(500, 490, 501, 0, 0, 0). This is very different from the first version, which regresses the vector c(5, 1, 2) in the vector c(500, 490, 501).

If you want to use a dummy variable you could either pass it to the subset argument of lm or the weights argument.

with(df, summary(lm(Salary ~ Variable1, subset = Dummy_USA == 1)))
#&gt; 
#&gt; Call:
#&gt; lm(formula = Salary ~ Variable1, subset = Dummy_USA == 1)
#&gt; 
#&gt; Residuals:
#&gt;       1       2       3 
#&gt;  1.6847 -0.1532 -1.5315 
#&gt; 
#&gt; Coefficients:
#&gt;              Estimate Std. Error t value Pr(&gt;|t|)
#&gt; (Intercept) -104.7928   131.8453  -0.795    0.572
#&gt; Variable1      0.2162     0.2653   0.815    0.565
#&gt; 
#&gt; Residual standard error: 2.282 on 1 degrees of freedom
#&gt; Multiple R-squared:  0.3992, Adjusted R-squared:  -0.2017 
#&gt; F-statistic: 0.6644 on 1 and 1 DF,  p-value: 0.5646

with(df, summary(lm(Salary ~ Variable1, weights = Dummy_USA)))
#&gt; 
#&gt; Call:
#&gt; lm(formula = Salary ~ Variable1, weights = Dummy_USA)
#&gt; 
#&gt; Weighted Residuals:
#&gt;       1       2       3       4       5       6 
#&gt;  1.6847 -0.1532 -1.5315  0.0000  0.0000  0.0000 
#&gt; 
#&gt; Coefficients:
#&gt;              Estimate Std. Error t value Pr(&gt;|t|)
#&gt; (Intercept) -104.7928   131.8453  -0.795    0.572
#&gt; Variable1      0.2162     0.2653   0.815    0.565
#&gt; 
#&gt; Residual standard error: 2.282 on 1 degrees of freedom
#&gt; Multiple R-squared:  0.3992, Adjusted R-squared:  -0.2017 
#&gt; F-statistic: 0.6644 on 1 and 1 DF,  p-value: 0.5646

<sup>Created on 2023-03-20 with reprex v2.0.2</sup>

答案2

得分: 2

为了方便引用，让我们将"Salary"和"Variable1"分别表示为y和x，同时让b0和b1表示截距和斜率。然后，第一个lm不涉及y[4]、y[5]和y[6]，但第二个涉及它们。

具体来说，第一个lm在b0和b1上最小化以下内容：

(y[1] - b0 - b1 * x[1])^2 + (y[2] - b0 - b1 * x[2])^2 + (y[3] - b0 - b1 * x[3])^2

而第二个在此基础上还最小化以下内容：

(y[4] - b0)^2 + (y[5] - b0)^2 + (y[6] - b0)^2

英文:

To make this easier to refer to let us refer to Salary and Variable1
as y and x respectively and let b0 and b1 be the intercept and slope. Then
the first lm does not involve y[4], y[5], y[6] but the second one does.

In particular, the first lm is minimizing the following over b0 and b1

(y[1] - b0 - b1 * x[1])^2 + (y[2] - b0 - b1 * x[2])^2 + (y[3] - b0 - b1 * x[3])^2

whereas the second one is minimizing that plus

(y[4] - b0)^2 + (y[5] - b0)^2 + (y[6] - b0)^2

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

回归与乘以另一个虚拟变量

问题

答案2

write a csv file on working directory based on operation system in r

如何基于独占的共享值在R中选择列？

根据它们的属性选择列如何操作？

在R中基于特定参数合并多个观测数据。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论