带虚拟变量交互的LM模型

huangapple go评论72阅读模式
英文:

lm with dummies' interactions

问题

我一直在使用 Prestige 数据集,它来自 mdhglm 包。我想了解一下,如果考虑了预测变量 'type' 的虚拟变量之间的交互作用,我的模型会如何变化。我没有遇到任何问题,但在输出中唯一引起问题的是 '蓝领工人' 的虚拟变量,因为它将自身标记为 NA,因此甚至影响到其他预测变量与蓝领工人之间的交互作用。但我没有任何 NA,并且我的虚拟变量正常工作,所以我不明白。你能帮助我吗?

Prestige3$professional <- ifelse(Prestige3$type == "prof", 1, 0)
Prestige3$white_collars <- ifelse(Prestige3$type == "wc", 1, 0)
Prestige3$blue_collars <- ifelse(Prestige3$type == "bc", 1, 0)

modello_interazioni <- lm(prestige ~ women * professional + education * professional + income_log * professional + women * white_collars + education * white_collars + income_log * white_collars +
women * blue_collars + education * blue_collars +  income_log * blue_collars, data = Prestige3) 

summary(modello_interazioni)

我已经尝试重新创建虚拟变量,因为我认为这可能是问题,但它们正常工作。我再次检查了 NA,但我没有发现任何问题。

英文:

I have been using the dataset Prestige from mdhglm package. I was interested to understand how my model would change if I considered the interactions between dummies (of the predictor 'type'). I wasn't having any problems, but in the output the only one that is causing me problem is the dummy of 'blue collars' because it says NA to the dummy itself and because of that even to the interactions between the other predictors and blue collar. But I don't have any NA and my dummy is working fine, so I don't understand. Can you please help me?

Prestige3$professional &lt;- ifelse(Prestige3$type == &quot;prof&quot;, 1, 0)
Prestige3$white_collars &lt;- ifelse(Prestige3$type == &quot;wc&quot;, 1, 0)
Prestige3$blue_collars &lt;- ifelse(Prestige3$type == &quot;bc&quot;, 1, 0)

modello_interazioni &lt;- lm(prestige ~ women * professional + education * professional + income_log * professional + women * white_collars + education * white_collars + income_log * white_collars +
women * blue_collars + education * blue_collars +  income_log * blue_collars, data = Prestige3) 

summary(modello_interazioni)

I have tried to create dummies again because I thought that it could be the problem, but they are working. I have controlled again the NA, but I don't have any.

答案1

得分: 1

在输出表格中,你可能会注意到它说:“系数:(4未定义,因为存在共线性)”,就在系数表格的上方。

所以,这可能有很多原因。通常,这是由于共线性引起的,会导致模型发生问题。在这种情况下,你不需要虚拟变量,因为你可以在公式中将它们设置为分类变量,使用 C() 来将其转换为分类变量。

model <- (lm(prestige ~ women * C(type) + 
                            education * C(type) + 
                            income * C(type), 
                          data = Prestige))
summary(model)

然后会得到这个表格:

系数:
                        估计值    标准误差    t 值  Pr(>|t|)    
(截距)             -5.822e+00  7.311e+00  -0.796  0.42803    
women                  1.343e-01  4.656e-02   2.885  0.00494 ** 
C(type)prof            2.436e+01  1.351e+01   1.803  0.07496 .  
C(type)wc             -2.178e+01  1.727e+01  -1.261  0.21081    
education              1.625e+00  9.163e-01   1.773  0.07971 .  
income                 4.692e-03  6.691e-04   7.013 5.00e-10 *** 
women:C(type)prof     -1.601e-01  6.506e-02  -2.460  0.01588 *  
women:C(type)wc        2.893e-02  1.117e-01   0.259  0.79619    
C(type)prof:education  1.512e+00  1.235e+00   1.224  0.22423    
C(type)wc:education    2.123e+00  2.190e+00   0.970  0.33491    
C(type)prof:income    -4.144e-03  7.132e-04  -5.810 1.03e-07 *** 
C(type)wc:income      -7.527e-04  1.814e-03  -0.415  0.67924 

希望这回答了你的问题。
~R

英文:

In the output table, you may notice that it says "Coefficients: (4 not defined because of singularities)" just above the table of coefficients.

So, this can be for a number of reasons. Usually, this is because of colinearity, and it creates an angry model. In this case, you don't need the dummy variables because you can just set them as categorical variables in your formula using C() to make it a categorical variable.

model &lt;- (lm(prestige ~ women * C(type) + 
                            education * C(type) + 
                            income * C(type), 
                          data = Prestige))
summary(model)

Which then gives this table:

Coefficients:
                        Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)           -5.822e+00  7.311e+00  -0.796  0.42803    
women                  1.343e-01  4.656e-02   2.885  0.00494 ** 
C(type)prof            2.436e+01  1.351e+01   1.803  0.07496 .  
C(type)wc             -2.178e+01  1.727e+01  -1.261  0.21081    
education              1.625e+00  9.163e-01   1.773  0.07971 .  
income                 4.692e-03  6.691e-04   7.013 5.00e-10 ***
women:C(type)prof     -1.601e-01  6.506e-02  -2.460  0.01588 *  
women:C(type)wc        2.893e-02  1.117e-01   0.259  0.79619    
C(type)prof:education  1.512e+00  1.235e+00   1.224  0.22423    
C(type)wc:education    2.123e+00  2.190e+00   0.970  0.33491    
C(type)prof:income    -4.144e-03  7.132e-04  -5.810 1.03e-07 ***
C(type)wc:income      -7.527e-04  1.814e-03  -0.415  0.67924 

Hope that answers your question.
~R

答案2

得分: 0

有两种情况会导致系数为NA。

  • 当你拥有的预测变量比观测值的数量还要多时。也就是说,你无法估计所有的系数。在这种情况下,甚至标准误差也会是NA,t检验/ p值也都是NA。你可以使用一半散点图来确定效应。

  • 当存在完全的别名时。

在你的情况下,你遇到了第二种情况。两列完全相同,或者一列是由其他列完全确定的组合而成的,没有随机性。尝试使用alias函数来确定完全相同的列:

alias(modello_interazioni)

从上面可以看到,具有非零值的列变量与行名变量完全别名。例如 blue_collars = Intercept + professionals + white_collars。由于这种完全线性关系,其中一个必须是NA。


需要注意的是,你应该考虑将你的代码运行为:

summary(lm(prestige~(women + education + income_log) * type, Prestige3))

这样会得到你想要的结果。除非你是从头开始实施线性回归,否则无需手动创建虚拟变量。

summary(modello_interazioni)

系数:(由于奇异性,有4个系数未定义)

Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -172.83613   26.17288  -6.604 3.17e-09 ***
women                     0.14059    0.04758   2.955 0.004033 ** 
professional            147.25606   38.83048   3.792 0.000277 *** 
education                 2.42215    0.88082   2.750 0.007266 ** 
income_log               21.78584    3.15780   6.899 8.38e-10 ***
white_collars           -24.50672   68.08447  -0.360 0.719770    
blue_collars                   NA         NA      NA       NA    
women:professional       -0.16678    0.06888  -2.421 0.017561 *  
professional:education    0.68858    1.23286   0.559 0.577937    
professional:income_log -16.29484    4.55783  -3.575 0.000577 *** 
women:white_collars       0.05693    0.11155   0.510 0.611098    
education:white_collars   0.83715    2.17074   0.386 0.700706    
income_log:white_collars  1.06471    8.95592   0.119 0.905645    
women:blue_collars             NA         NA      NA       NA    
education:blue_collars         NA         NA      NA       NA    
income_log:blue_collars        NA         NA      NA       NA 
英文:

There are two situations where the coefficients will be NA.

  • When you have more predictors than the number of observations. ie You are unable to estimate all the coefficients. In this situation even the standard error will be NA and t-tests/p-values will all be NA. You use half plots to determine the effects

  • When there is complete aliases.

In your case, you are experiencing the second situation. Two columns that are exactly the same. or a column derived from a combination of the others perfectly without randomness. Try using the function alias to determine the columns that are exactly the same:

alias(modello_interazioni)

Notice from the above that the column variables which have non-0 values are completely aliased to the rowname variables. eg blue_collars = Intercept + professionals + white_collars. Due to this perfectly linear relationship, one must be NA.


Point to note, you should consider running your code as:

summary(lm(prestige~(women + education + income_log) * type, Prestige3))

                      Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)         -172.83613   26.17288  -6.604 3.17e-09 ***
women                  0.14059    0.04758   2.955 0.004033 ** 
education              2.42215    0.88082   2.750 0.007266 ** 
income_log            21.78584    3.15780   6.899 8.38e-10 ***
typeprof             147.25606   38.83048   3.792 0.000277 ***
typewc               -24.50672   68.08447  -0.360 0.719770    
women:typeprof        -0.16678    0.06888  -2.421 0.017561 *  
women:typewc           0.05693    0.11155   0.510 0.611098    
education:typeprof     0.68858    1.23286   0.559 0.577937    
education:typewc       0.83715    2.17074   0.386 0.700706    
income_log:typeprof  -16.29484    4.55783  -3.575 0.000577 ***
income_log:typewc      1.06471    8.95592   0.119 0.905645  

which gives the result you want. No need to manually create the dummy variables unless you are implementing linear regression from scratch.

summary(modello_interazioni)
Coefficients: (4 not defined because of singularities)
                           Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)              -172.83613   26.17288  -6.604 3.17e-09 ***
women                       0.14059    0.04758   2.955 0.004033 ** 
professional              147.25606   38.83048   3.792 0.000277 ***
education                   2.42215    0.88082   2.750 0.007266 ** 
income_log                 21.78584    3.15780   6.899 8.38e-10 ***
white_collars             -24.50672   68.08447  -0.360 0.719770    
blue_collars                     NA         NA      NA       NA    
women:professional         -0.16678    0.06888  -2.421 0.017561 *  
professional:education      0.68858    1.23286   0.559 0.577937    
professional:income_log   -16.29484    4.55783  -3.575 0.000577 ***
women:white_collars         0.05693    0.11155   0.510 0.611098    
education:white_collars     0.83715    2.17074   0.386 0.700706    
income_log:white_collars    1.06471    8.95592   0.119 0.905645    
women:blue_collars               NA         NA      NA       NA    
education:blue_collars           NA         NA      NA       NA    
income_log:blue_collars          NA         NA      NA       NA  

huangapple
  • 本文由 发表于 2023年7月10日 16:03:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76651804.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定