英文:
lm with dummies' interactions
问题
我一直在使用 Prestige 数据集,它来自 mdhglm 包。我想了解一下,如果考虑了预测变量 'type' 的虚拟变量之间的交互作用,我的模型会如何变化。我没有遇到任何问题,但在输出中唯一引起问题的是 '蓝领工人' 的虚拟变量,因为它将自身标记为 NA,因此甚至影响到其他预测变量与蓝领工人之间的交互作用。但我没有任何 NA,并且我的虚拟变量正常工作,所以我不明白。你能帮助我吗?
Prestige3$professional <- ifelse(Prestige3$type == "prof", 1, 0)
Prestige3$white_collars <- ifelse(Prestige3$type == "wc", 1, 0)
Prestige3$blue_collars <- ifelse(Prestige3$type == "bc", 1, 0)
modello_interazioni <- lm(prestige ~ women * professional + education * professional + income_log * professional + women * white_collars + education * white_collars + income_log * white_collars +
women * blue_collars + education * blue_collars + income_log * blue_collars, data = Prestige3)
summary(modello_interazioni)
我已经尝试重新创建虚拟变量,因为我认为这可能是问题,但它们正常工作。我再次检查了 NA,但我没有发现任何问题。
英文:
I have been using the dataset Prestige from mdhglm package. I was interested to understand how my model would change if I considered the interactions between dummies (of the predictor 'type'). I wasn't having any problems, but in the output the only one that is causing me problem is the dummy of 'blue collars' because it says NA to the dummy itself and because of that even to the interactions between the other predictors and blue collar. But I don't have any NA and my dummy is working fine, so I don't understand. Can you please help me?
Prestige3$professional <- ifelse(Prestige3$type == "prof", 1, 0)
Prestige3$white_collars <- ifelse(Prestige3$type == "wc", 1, 0)
Prestige3$blue_collars <- ifelse(Prestige3$type == "bc", 1, 0)
modello_interazioni <- lm(prestige ~ women * professional + education * professional + income_log * professional + women * white_collars + education * white_collars + income_log * white_collars +
women * blue_collars + education * blue_collars + income_log * blue_collars, data = Prestige3)
summary(modello_interazioni)
I have tried to create dummies again because I thought that it could be the problem, but they are working. I have controlled again the NA, but I don't have any.
答案1
得分: 1
在输出表格中,你可能会注意到它说:“系数:(4未定义,因为存在共线性)”,就在系数表格的上方。
所以,这可能有很多原因。通常,这是由于共线性引起的,会导致模型发生问题。在这种情况下,你不需要虚拟变量,因为你可以在公式中将它们设置为分类变量,使用 C() 来将其转换为分类变量。
model <- (lm(prestige ~ women * C(type) +
education * C(type) +
income * C(type),
data = Prestige))
summary(model)
然后会得到这个表格:
系数:
估计值 标准误差 t 值 Pr(>|t|)
(截距) -5.822e+00 7.311e+00 -0.796 0.42803
women 1.343e-01 4.656e-02 2.885 0.00494 **
C(type)prof 2.436e+01 1.351e+01 1.803 0.07496 .
C(type)wc -2.178e+01 1.727e+01 -1.261 0.21081
education 1.625e+00 9.163e-01 1.773 0.07971 .
income 4.692e-03 6.691e-04 7.013 5.00e-10 ***
women:C(type)prof -1.601e-01 6.506e-02 -2.460 0.01588 *
women:C(type)wc 2.893e-02 1.117e-01 0.259 0.79619
C(type)prof:education 1.512e+00 1.235e+00 1.224 0.22423
C(type)wc:education 2.123e+00 2.190e+00 0.970 0.33491
C(type)prof:income -4.144e-03 7.132e-04 -5.810 1.03e-07 ***
C(type)wc:income -7.527e-04 1.814e-03 -0.415 0.67924
希望这回答了你的问题。
~R
英文:
In the output table, you may notice that it says "Coefficients: (4 not defined because of singularities)" just above the table of coefficients.
So, this can be for a number of reasons. Usually, this is because of colinearity, and it creates an angry model. In this case, you don't need the dummy variables because you can just set them as categorical variables in your formula using C() to make it a categorical variable.
model <- (lm(prestige ~ women * C(type) +
education * C(type) +
income * C(type),
data = Prestige))
summary(model)
Which then gives this table:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.822e+00 7.311e+00 -0.796 0.42803
women 1.343e-01 4.656e-02 2.885 0.00494 **
C(type)prof 2.436e+01 1.351e+01 1.803 0.07496 .
C(type)wc -2.178e+01 1.727e+01 -1.261 0.21081
education 1.625e+00 9.163e-01 1.773 0.07971 .
income 4.692e-03 6.691e-04 7.013 5.00e-10 ***
women:C(type)prof -1.601e-01 6.506e-02 -2.460 0.01588 *
women:C(type)wc 2.893e-02 1.117e-01 0.259 0.79619
C(type)prof:education 1.512e+00 1.235e+00 1.224 0.22423
C(type)wc:education 2.123e+00 2.190e+00 0.970 0.33491
C(type)prof:income -4.144e-03 7.132e-04 -5.810 1.03e-07 ***
C(type)wc:income -7.527e-04 1.814e-03 -0.415 0.67924
Hope that answers your question.
~R
答案2
得分: 0
有两种情况会导致系数为NA。
-
当你拥有的预测变量比观测值的数量还要多时。也就是说,你无法估计所有的系数。在这种情况下,甚至标准误差也会是NA,t检验/ p值也都是NA。你可以使用一半散点图来确定效应。
-
当存在完全的别名时。
在你的情况下,你遇到了第二种情况。两列完全相同,或者一列是由其他列完全确定的组合而成的,没有随机性。尝试使用alias
函数来确定完全相同的列:
alias(modello_interazioni)
从上面可以看到,具有非零值的列变量与行名变量完全别名。例如 blue_collars = Intercept + professionals + white_collars
。由于这种完全线性关系,其中一个必须是NA。
需要注意的是,你应该考虑将你的代码运行为:
summary(lm(prestige~(women + education + income_log) * type, Prestige3))
这样会得到你想要的结果。除非你是从头开始实施线性回归,否则无需手动创建虚拟变量。
summary(modello_interazioni)
系数:(由于奇异性,有4个系数未定义)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -172.83613 26.17288 -6.604 3.17e-09 ***
women 0.14059 0.04758 2.955 0.004033 **
professional 147.25606 38.83048 3.792 0.000277 ***
education 2.42215 0.88082 2.750 0.007266 **
income_log 21.78584 3.15780 6.899 8.38e-10 ***
white_collars -24.50672 68.08447 -0.360 0.719770
blue_collars NA NA NA NA
women:professional -0.16678 0.06888 -2.421 0.017561 *
professional:education 0.68858 1.23286 0.559 0.577937
professional:income_log -16.29484 4.55783 -3.575 0.000577 ***
women:white_collars 0.05693 0.11155 0.510 0.611098
education:white_collars 0.83715 2.17074 0.386 0.700706
income_log:white_collars 1.06471 8.95592 0.119 0.905645
women:blue_collars NA NA NA NA
education:blue_collars NA NA NA NA
income_log:blue_collars NA NA NA NA
英文:
There are two situations where the coefficients will be NA.
-
When you have more predictors than the number of observations. ie You are unable to estimate all the coefficients. In this situation even the standard error will be NA and t-tests/p-values will all be NA. You use half plots to determine the effects
-
When there is complete aliases.
In your case, you are experiencing the second situation. Two columns that are exactly the same. or a column derived from a combination of the others perfectly without randomness. Try using the function alias
to determine the columns that are exactly the same:
alias(modello_interazioni)
Notice from the above that the column variables which have non-0 values are completely aliased to the rowname variables. eg blue_collars = Intercept + professionals + white_collars
. Due to this perfectly linear relationship, one must be NA.
Point to note, you should consider running your code as:
summary(lm(prestige~(women + education + income_log) * type, Prestige3))
Estimate Std. Error t value Pr(>|t|)
(Intercept) -172.83613 26.17288 -6.604 3.17e-09 ***
women 0.14059 0.04758 2.955 0.004033 **
education 2.42215 0.88082 2.750 0.007266 **
income_log 21.78584 3.15780 6.899 8.38e-10 ***
typeprof 147.25606 38.83048 3.792 0.000277 ***
typewc -24.50672 68.08447 -0.360 0.719770
women:typeprof -0.16678 0.06888 -2.421 0.017561 *
women:typewc 0.05693 0.11155 0.510 0.611098
education:typeprof 0.68858 1.23286 0.559 0.577937
education:typewc 0.83715 2.17074 0.386 0.700706
income_log:typeprof -16.29484 4.55783 -3.575 0.000577 ***
income_log:typewc 1.06471 8.95592 0.119 0.905645
which gives the result you want. No need to manually create the dummy variables unless you are implementing linear regression from scratch.
summary(modello_interazioni)
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -172.83613 26.17288 -6.604 3.17e-09 ***
women 0.14059 0.04758 2.955 0.004033 **
professional 147.25606 38.83048 3.792 0.000277 ***
education 2.42215 0.88082 2.750 0.007266 **
income_log 21.78584 3.15780 6.899 8.38e-10 ***
white_collars -24.50672 68.08447 -0.360 0.719770
blue_collars NA NA NA NA
women:professional -0.16678 0.06888 -2.421 0.017561 *
professional:education 0.68858 1.23286 0.559 0.577937
professional:income_log -16.29484 4.55783 -3.575 0.000577 ***
women:white_collars 0.05693 0.11155 0.510 0.611098
education:white_collars 0.83715 2.17074 0.386 0.700706
income_log:white_collars 1.06471 8.95592 0.119 0.905645
women:blue_collars NA NA NA NA
education:blue_collars NA NA NA NA
income_log:blue_collars NA NA NA NA
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论