2023年7月10日 16:03:28go评论99阅读模式

英文:

lm with dummies' interactions

问题

我一直在使用 Prestige 数据集，它来自 mdhglm 包。我想了解一下，如果考虑了预测变量 'type' 的虚拟变量之间的交互作用，我的模型会如何变化。我没有遇到任何问题，但在输出中唯一引起问题的是 '蓝领工人' 的虚拟变量，因为它将自身标记为 NA，因此甚至影响到其他预测变量与蓝领工人之间的交互作用。但我没有任何 NA，并且我的虚拟变量正常工作，所以我不明白。你能帮助我吗？

Prestige3$professional <- ifelse(Prestige3$type == "prof", 1, 0)
Prestige3$white_collars <- ifelse(Prestige3$type == "wc", 1, 0)
Prestige3$blue_collars <- ifelse(Prestige3$type == "bc", 1, 0)
modello_interazioni <- lm(prestige ~ women * professional + education * professional + income_log * professional + women * white_collars + education * white_collars + income_log * white_collars +
women * blue_collars + education * blue_collars +  income_log * blue_collars, data = Prestige3) 
summary(modello_interazioni)

我已经尝试重新创建虚拟变量，因为我认为这可能是问题，但它们正常工作。我再次检查了 NA，但我没有发现任何问题。

英文:

I have been using the dataset Prestige from mdhglm package. I was interested to understand how my model would change if I considered the interactions between dummies (of the predictor 'type'). I wasn't having any problems, but in the output the only one that is causing me problem is the dummy of 'blue collars' because it says NA to the dummy itself and because of that even to the interactions between the other predictors and blue collar. But I don't have any NA and my dummy is working fine, so I don't understand. Can you please help me?

Prestige3$professional &lt;- ifelse(Prestige3$type == &quot;prof&quot;, 1, 0)
Prestige3$white_collars &lt;- ifelse(Prestige3$type == &quot;wc&quot;, 1, 0)
Prestige3$blue_collars &lt;- ifelse(Prestige3$type == &quot;bc&quot;, 1, 0)
modello_interazioni &lt;- lm(prestige ~ women * professional + education * professional + income_log * professional + women * white_collars + education * white_collars + income_log * white_collars +
women * blue_collars + education * blue_collars +  income_log * blue_collars, data = Prestige3) 
summary(modello_interazioni)

I have tried to create dummies again because I thought that it could be the problem, but they are working. I have controlled again the NA, but I don't have any.

答案1

得分: 1

在输出表格中，你可能会注意到它说：“系数：（4未定义，因为存在共线性）”，就在系数表格的上方。

所以，这可能有很多原因。通常，这是由于共线性引起的，会导致模型发生问题。在这种情况下，你不需要虚拟变量，因为你可以在公式中将它们设置为分类变量，使用 C() 来将其转换为分类变量。

model <- (lm(prestige ~ women * C(type) + 
                            education * C(type) + 
                            income * C(type), 
                          data = Prestige))
summary(model)

然后会得到这个表格：

系数：
                        估计值    标准误差    t 值  Pr(>|t|)    
(截距)             -5.822e+00  7.311e+00  -0.796  0.42803    
women                  1.343e-01  4.656e-02   2.885  0.00494 ** 
C(type)prof            2.436e+01  1.351e+01   1.803  0.07496 .  
C(type)wc             -2.178e+01  1.727e+01  -1.261  0.21081    
education              1.625e+00  9.163e-01   1.773  0.07971 .  
income                 4.692e-03  6.691e-04   7.013 5.00e-10 *** 
women:C(type)prof     -1.601e-01  6.506e-02  -2.460  0.01588 *  
women:C(type)wc        2.893e-02  1.117e-01   0.259  0.79619    
C(type)prof:education  1.512e+00  1.235e+00   1.224  0.22423    
C(type)wc:education    2.123e+00  2.190e+00   0.970  0.33491    
C(type)prof:income    -4.144e-03  7.132e-04  -5.810 1.03e-07 *** 
C(type)wc:income      -7.527e-04  1.814e-03  -0.415  0.67924

希望这回答了你的问题。
~R

英文:

In the output table, you may notice that it says "Coefficients: (4 not defined because of singularities)" just above the table of coefficients.

So, this can be for a number of reasons. Usually, this is because of colinearity, and it creates an angry model. In this case, you don't need the dummy variables because you can just set them as categorical variables in your formula using C() to make it a categorical variable.

model &lt;- (lm(prestige ~ women * C(type) + 
                            education * C(type) + 
                            income * C(type), 
                          data = Prestige))
summary(model)

Which then gives this table:

Coefficients:
                        Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)           -5.822e+00  7.311e+00  -0.796  0.42803    
women                  1.343e-01  4.656e-02   2.885  0.00494 ** 
C(type)prof            2.436e+01  1.351e+01   1.803  0.07496 .  
C(type)wc             -2.178e+01  1.727e+01  -1.261  0.21081    
education              1.625e+00  9.163e-01   1.773  0.07971 .  
income                 4.692e-03  6.691e-04   7.013 5.00e-10 ***
women:C(type)prof     -1.601e-01  6.506e-02  -2.460  0.01588 *  
women:C(type)wc        2.893e-02  1.117e-01   0.259  0.79619    
C(type)prof:education  1.512e+00  1.235e+00   1.224  0.22423    
C(type)wc:education    2.123e+00  2.190e+00   0.970  0.33491    
C(type)prof:income    -4.144e-03  7.132e-04  -5.810 1.03e-07 ***
C(type)wc:income      -7.527e-04  1.814e-03  -0.415  0.67924

Hope that answers your question.
~R

答案2

得分: 0

有两种情况会导致系数为NA。

当你拥有的预测变量比观测值的数量还要多时。也就是说，你无法估计所有的系数。在这种情况下，甚至标准误差也会是NA，t检验/ p值也都是NA。你可以使用一半散点图来确定效应。
当存在完全的别名时。

在你的情况下，你遇到了第二种情况。两列完全相同，或者一列是由其他列完全确定的组合而成的，没有随机性。尝试使用alias函数来确定完全相同的列：

alias(modello_interazioni)

从上面可以看到，具有非零值的列变量与行名变量完全别名。例如 blue_collars = Intercept + professionals + white_collars。由于这种完全线性关系，其中一个必须是NA。

需要注意的是，你应该考虑将你的代码运行为：

summary(lm(prestige~(women + education + income_log) * type, Prestige3))

这样会得到你想要的结果。除非你是从头开始实施线性回归，否则无需手动创建虚拟变量。

summary(modello_interazioni)

系数：（由于奇异性，有4个系数未定义）

Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -172.83613   26.17288  -6.604 3.17e-09 ***
women                     0.14059    0.04758   2.955 0.004033 ** 
professional            147.25606   38.83048   3.792 0.000277 *** 
education                 2.42215    0.88082   2.750 0.007266 ** 
income_log               21.78584    3.15780   6.899 8.38e-10 ***
white_collars           -24.50672   68.08447  -0.360 0.719770    
blue_collars                   NA         NA      NA       NA    
women:professional       -0.16678    0.06888  -2.421 0.017561 *  
professional:education    0.68858    1.23286   0.559 0.577937    
professional:income_log -16.29484    4.55783  -3.575 0.000577 *** 
women:white_collars       0.05693    0.11155   0.510 0.611098    
education:white_collars   0.83715    2.17074   0.386 0.700706    
income_log:white_collars  1.06471    8.95592   0.119 0.905645    
women:blue_collars             NA         NA      NA       NA    
education:blue_collars         NA         NA      NA       NA    
income_log:blue_collars        NA         NA      NA       NA

英文:

There are two situations where the coefficients will be NA.

When you have more predictors than the number of observations. ie You are unable to estimate all the coefficients. In this situation even the standard error will be NA and t-tests/p-values will all be NA. You use half plots to determine the effects
When there is complete aliases.

In your case, you are experiencing the second situation. Two columns that are exactly the same. or a column derived from a combination of the others perfectly without randomness. Try using the function alias to determine the columns that are exactly the same:

alias(modello_interazioni)

Notice from the above that the column variables which have non-0 values are completely aliased to the rowname variables. eg blue_collars = Intercept + professionals + white_collars. Due to this perfectly linear relationship, one must be NA.

Point to note, you should consider running your code as:

summary(lm(prestige~(women + education + income_log) * type, Prestige3))
                      Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)         -172.83613   26.17288  -6.604 3.17e-09 ***
women                  0.14059    0.04758   2.955 0.004033 ** 
education              2.42215    0.88082   2.750 0.007266 ** 
income_log            21.78584    3.15780   6.899 8.38e-10 ***
typeprof             147.25606   38.83048   3.792 0.000277 ***
typewc               -24.50672   68.08447  -0.360 0.719770    
women:typeprof        -0.16678    0.06888  -2.421 0.017561 *  
women:typewc           0.05693    0.11155   0.510 0.611098    
education:typeprof     0.68858    1.23286   0.559 0.577937    
education:typewc       0.83715    2.17074   0.386 0.700706    
income_log:typeprof  -16.29484    4.55783  -3.575 0.000577 ***
income_log:typewc      1.06471    8.95592   0.119 0.905645

which gives the result you want. No need to manually create the dummy variables unless you are implementing linear regression from scratch.

summary(modello_interazioni)
Coefficients: (4 not defined because of singularities)
                           Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)              -172.83613   26.17288  -6.604 3.17e-09 ***
women                       0.14059    0.04758   2.955 0.004033 ** 
professional              147.25606   38.83048   3.792 0.000277 ***
education                   2.42215    0.88082   2.750 0.007266 ** 
income_log                 21.78584    3.15780   6.899 8.38e-10 ***
white_collars             -24.50672   68.08447  -0.360 0.719770    
blue_collars                     NA         NA      NA       NA    
women:professional         -0.16678    0.06888  -2.421 0.017561 *  
professional:education      0.68858    1.23286   0.559 0.577937    
professional:income_log   -16.29484    4.55783  -3.575 0.000577 ***
women:white_collars         0.05693    0.11155   0.510 0.611098    
education:white_collars     0.83715    2.17074   0.386 0.700706    
income_log:white_collars    1.06471    8.95592   0.119 0.905645    
women:blue_collars               NA         NA      NA       NA    
education:blue_collars           NA         NA      NA       NA    
income_log:blue_collars          NA         NA      NA       NA

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

带虚拟变量交互的LM模型

问题

答案1

答案2

R: 在区间内计数观测值

Conditional text colour based on background colour (High contrast text) for ggplot (geom_col, and geom_text)

解释变量在计算预测模型准确性方面的含义（迁移到CrossValidated）

使用filter在R中选择多列中的特定值。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。