英文:
Use of svyglm and svydesign with R for multistage stratified cluster design
问题
我有一个复杂的数据集,它是通过多阶分层群集设计创建的。最初,我使用glm对其进行了分析,但现在意识到必须使用svyglm。我不太确定如何最好地利用svyglm对数据进行建模。我想知道是否有人可以帮助我澄清。
我试图查看在时间1获取的各种协变量对时间2获取的二元结果产生的影响。
抽样策略如下:州 -> 城市/农村 -> 区 -> 亚区 -> 村庄。在每个村庄中,随机选择个体,每个个体都有一个id(uniqid)。
在数据框中,我有每个抽样策略阶段的变量。我还有以下变量:结果,年龄,性别,收入,婚姻状况,城市或农村地区,uniqid,权重。我想要的回归方程公式是:结果 ~ 年龄 + 性别 + 收入 + 婚姻状况 + 城市或农村地区。权重由权重变量编码。我已将family设置为binomial(link = logit)。
如果有人知道如何在R中使用svyglm编码这种方法,我将非常感激。我对应该输入ID、fpc和nest感到困惑。我是否必须指定分层设计的所有级别,还是只需一些级别?
对于任何指导或能够很好解释这一问题的资源,我将非常感激。
英文:
I have a complicated data set which was made by a multistage stratified cluster design. I had originally analysed this using glm, however now realise that I have to use svyglm. I'm not quite sure about how is best to model the data utilising svyglm. I was wondering if anyone could help shed some light.
I am attempting to see the effect that a variety of covariates taken at time 1 have on a binary outcome taken at time 2.
The sampling strategy was as follows: state -> urban/rural -> district -> subdistrict -> village. Within each village, individuals were randomly selected, with each of these having an id (uniqid).
I have a variable in the df for each of these stages of the sampling strategy. I also have the following variables: outcome, age, sex, income, marital_status, urban_or_rural_area, uniqid, weights. The formula that I want for my regression equation is outcome ~ age + sex + income + marital_status + urban_or_rural_area . Weights are coded by the weights variable. I had set the family to binomial(link = logit).
If anyone has any idea how such an approach could be coded in R with svyglm I would be most appreciative. I'm quite confused as to what should be inputted as ID, fpc and nest. Do I have to specify all levels of the stratified design or just some?
Any direction, or resources which explain this well would be massively appreciated.
答案1
得分: 1
你没有提供关于设计的足够信息:哪些地理单位是层和哪些是簇。例如,我猜测您在所有州都对城市和农村进行了抽样,但我不知道您是否对所有村庄进行了抽样,也不知道您的总抽样分数是大还是小(因此是否可以使用带替代近似)。
假设您只对一些地区进行了抽样,因此地区是您的主要抽样单元,而人口的总抽样比例较小。设计命令如下:
your_design <- svydesign(id=~district, weights=~weights,
strata=~interaction(state, urban_rural,drop=TRUE),
data=your_data_frame)
即,层是州和城市/农村的组合,任何不在您的数据集中的组合在人口中不存在(也许有些州都是全农村或全城市)。在每个层中,您有地区,只有一些地区出现在样本中。在您的地理层次结构中,地区是第一个被抽样而不是详尽列举的级别。
除非您想指定完整的无替代多阶段设计,否则不需要使用fpc
。
nest
选项不涉及调查的执行方式,而是关于变量如何编码。美国国家卫生统计中心(感谢他们的付出)建立了许多设计,这些设计具有许多层和每层两个主要抽样单元。他们将这些主要抽样单元命名为1
和2
;也就是说,他们在每个层中重复使用名称1
和2
。svydesign
函数预期在不同的层中有不同的抽样单元名称,并验证每个抽样单元名称只出现在一个层中,以检查数据错误。这一检查必须在NCHS调查和可能还有其他一些重新使用抽样单元名称的调查中禁用。您可以一开始不使用nest
选项;svydesign
会告诉您是否可能需要它。
最后,模型如下:
svyglm(outcome ~ age + sex + income + marital_status + urban_or_rural_area,
design=your_design, family=quasibinomial)
使用binomial
或quasibinomial
将给出相同的答案,但使用binomial
会产生一个无害的关于非整数权重的警告。如果使用quasibinomial
,将抑制无害的警告。
英文:
You don't really give enough information about the design: which of the geographical units are strata and which are clusters. For example, my guess is that you sample both urban and rural in all states, and you don't sample all villages, but I don't know whether you sample all districts or subdistricts. I also don't know whether your overall sampling fraction is large or small (so whether the with-replacement approximation is ok)
Let's pretend you sample just some districts, so districts are your Primary Sampling Units, and that the overall sampling fraction of people is small. The design command is
your_design <- svydesign(id=~district, weights=~weights,
strata=~interaction(state, urban_rural,drop=TRUE),
data=your_data_frame)
That is, the strata are combinations of state and urban/rural and any combinations that aren't in your data set don't exist in the population (maybe some states are all-rural or all-urban). Within each stratum you have districts, and only some of these appear in the sample. In your geographical hierarchy, districts are then the first level that is sampled rather than exhaustively enumerated.
You don't need fpc
unless you want to specify the full multistage design without replacement.
The nest
option is not about how the survey was done but is about how variables are coded. The US National Center for Health Statistics (bless their hearts) set up a lot of designs that have many strata and two primary sampling units per stratum. They call these primary sampling units 1
and 2
; that is, they reuse the names 1
and 2
in every stratum. The svydesign
function is set up to expect different sampling unit names in different strata, and to verify that each sampling unit name appears in just one stratum, as a check against data errors. This check has to be disabled for NCHS surveys and perhaps some others that also reuse sampling unit names. You can always leave out the nest
option at first; svydesign
will tell you if it might be needed.
Finally, the models:
svyglm(outcome ~ age + sex + income + marital_status + urban_or_rural_area,
design=your_design, family=quasibinomial)
Using binomial
or quasibinomial
will give identical answers, but using binomial
will give you a harmless warning about non-integer weights. If you use quasibinomial
, the harmless warning is suppressed.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论