英文:
Group vs individual data: how to correct standard errors?
问题
我有一组5个变量:Y是响应变量,而X1、X2、X3和X4是预测变量。共有3,000个观察值,我在R中拟合了一个回归模型:
fit <- lm(Y ~ X1 + X2 +X3 + x4, data = mydata)
问题是,我的数据涉及3,000个小单位(教区),这些单位嵌套在40个较大单位(县)中。Y、X1和X2是在教区级别上测量的,而X3和X4仅在县级别上可用。我担心lm函数会低估最后两个变量的系数的标准误差。有没有办法纠正这个问题?
英文:
I have a set of 5 variables: Y is a response, and X1, X2, x3, and X4 are the predictors. There are 3,000 observations, and I fit a regression model in R:
fit <- lm(Y ~ X1 + X2 +X3 + x4, data = mydata)
The problem is that my data relate to 3,000 small units (parishes) nested in 40 larger units (counties), and Y, X1, and X2 are measured at the parish level, whilst X3 and X4 are only available at a county level. I am concerned that the lm function will underestimate the standard errors for the coefficients for the last two variables. Is there a way to correct this?
答案1
得分: 3
混合模型!类似于
library(lmerTest) ## 这样你可以得到固定效应的 p 值
fit <- lmer(Y ~ X1 + X2 + X3 + X4 + (1 + X1 + X2 | county), data = mydata)
随机效应项允许截距以及 X1
和 X2
的教区级效应在不同县之间变化。
这里的逻辑是,固定效应 X1 + x2 + X3 + X4
指定模型应该允许这些变量影响总体水平的响应(即总体水平),而随机效应允许截距和 X1
、X2
的效应在不同县之间变化。一般规则是:(1)我们可以在随机效应项中包括在分组变量的级别内变化的任何变量(即因为 X1
和 X2
在教区级别变化,我们可以估计它们在不同县之间的效应的方差;由于 X3
和 X4
仅在县之间而不是县内变化,我们不能估计它们的效应在不同县之间的变化);(2)一般来说,因为随机效应是零中心化的,我们通常应该包括与每个随机效应项相对应的固定效应。
值得一提的是,你可以使用 equatiomatic
包提取 LaTeX 格式的模型规范,参见 vignette。
英文:
Mixed models! Something like
library(lmerTest) ## so you can get p-values on fixed effects
fit <- lmer(Y ~ X1 + X2 + X3 + X4 + (1 + X1 + X2 | county), data = mydata)
The random-effect term allows the intercept, as well as the parish-level effect of X1
and X2
, to vary across counties.
The logic here is that the fixed effects X1 + x2 + X3 + X4
specify that the model should allow these variables to affect the response at the population level (i.e., overall), while the random effects allow for the intercepts and the effects of X1
and X2
to vary across counties. The general rules are that (1) we can include any variable in a random-effects term that varies within the levels of the grouping variables (i.e., since X1
and X2
vary at the parish level, we can estimate the variance of their effects across counties; since X3
and X4
vary only between and not within counties, we cannot estimate the variation of their effects across counties) and (2) in general, because the random effects are zero-centered, we should usually include a fixed effect corresponding to each random-effects term.
For what it's worth you can use the equatiomatic
package to extract LaTeX-formatted model specifications, see the vignette.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论