英文:
Collapsing multiple observations based on specific parameters in R
问题
我对R相当陌生。我有一个包含8081个观测值和113个变量的数据集。数据是在4个波次(面板)中收集的,有些个体多次接受访谈。有时会问相同的问题,但某些问题只在其中一个波次中提出。大多数答案都是在一个尺度上(例如,您同意多少)或二进制上。每个个体都有唯一的数值ID,因此我知道要合并哪些行。我的因变量仅在第4波次中进行了调查。
数据看起来像这样:
df <- data.frame(ID = c(1, 1, 2, 2, 2, 2, 3, 4, 4), PANEL = c(1, 4, 1, 2, 3, 4, 2, 3, 4),
AGE = c(68, 68, 52, 52, 52, 52, 43, 33, 33), Q4 = c(2, 2, 1, 1, 1, 1, 2, 2, 1),
Q4_1 = c(2, 2, 1, 1, 1, 1, 2, 2, 1), Q4_1 = c(2, NA, NA, 3, NA, 3, 2, 3, NA),
Q5 = c(10, 10, 8, 9, 8, 7, 6, 6, 5))
有时每个个体在不同波次中的答案相同,但并非总是如此。我不需要知道答案如何随时间变化,而且波次之间的时间相对较短。我只需要每个受访者的概要信息。
理想情况下,对于每个个体,我想保留第4波次中的答案(如果他们参与其中),但用之前波次中的答案替换NA答案。我想知道是否有办法在不逐个个体的情况下完成这个操作,鉴于数据量很大。我还必须删除根本没有参与第4波次的个体的数据。
如果成功,上面的数据块最终会看起来像这样(稍后我将删除面板列):
ID PANEL AGE Q4 Q4_1 Q5
1 1 4 68 2 2 10
2 2 4 52 1 3 7
3 4 4 33 1 3 5
我一直在研究dplyr的summarise()函数,但似乎无法以我需要合并和不合并的方式那么具体。如果一些答案通过获取个体在不同波次中的回答的平均值来合并,那就不是问题,但如果个体在波次之间改变主意,那对于二进制答案将不起作用。
英文:
I am quite new to R. I have a dataset with 8081 observations for 113 variables. The data was collected in 4 waves (panels), with some individuals being interviewed multiple times. They were sometimes asked the same questions, but some questions were only asked during one wave. Most answers were on a scale (e.g. how much do you agree) or a binary. Each individual has a unique numeric ID so I know which rows to collapse. My dependent variable was only surveyed in wave 4.
The data looks something like this:
df <- data.frame (ID = c(1, 1, 2, 2, 2, 2, 3, 4, 4), PANEL = c(1, 4, 1, 2, 3, 4, 2, 3, 4),
AGE = c(68, 68, 52, 52, 52, 52, 43, 33, 33), Q4 = c(2, 2, 1, 1, 1, 1, 2, 2, 1),
Q4_1 = c(2, 2, 1, 1, 1, 1, 2, 2, 1), Q4_1 = c(2, NA, NA, 3, NA, 3, 2, 3, NA),
Q5 = c(10, 10, 8, 9, 8, 7, 6, 6, 5))
df
ID PANEL AGE Q4 Q4_1 Q4_1.1 Q5
1 1 1 68 2 2 2 10
2 1 4 68 2 2 NA 10
3 2 1 52 1 1 NA 8
4 2 2 52 1 1 3 9
5 2 3 52 1 1 NA 8
6 2 4 52 1 1 3 7
7 3 2 43 2 2 2 6
8 4 3 33 2 2 3 6
9 4 4 33 1 1 NA 5
etc...
Sometimes each individual's answers are the same across waves, but not always. I do not need to know how the answers vary in time and the waves were relatively close in time. I just need a profile of each individual surveyed.
Ideally, for each individual I'd want to keep the answers given in wave 4 (if they took part in it), but substituting the NA answers with what they answered in previous waves. I was wondering if there is any way to do this without going through every individual one by one, given the amount of data. I'll also have to remove data for individuals who did not take part in wave 4 at all.
If successful, the chunk of data above would end up looking something like this (+ I'll remove the panel column later):
ID PANEL AGE Q4 Q4_1 Q5
1 1 4 68 2 2 10
2 2 4 52 1 3 7
3 4 4 33 1 3 5
etc...
I've been looking into dplyr's summarise() function but it doesn't seem like I can be that specific with what I need to merge and not merge. It wouldn't be a problem if some of the answers were merged by getting an average of the individual's responses across waves, but that would not work for the binary answers if the individual changed their mind in between waves.
答案1
得分: 1
你可以使用 tidyr::fill()
。你可以在提供的示例数据集上执行以下操作:
df %>% group_by(ID) %>% fill(., starts_with("Q")) %>% filter(PANEL == 4)
输出结果:
ID PANEL AGE Q4 Q4_1 Q4_1.1 Q5
1 1 4 68 2 2 2 10
2 2 4 52 1 1 3 7
3 4 4 33 1 1 3 5
英文:
You may be looking for tidyr::fill()
.. You can do the following on your provided example dataset:
df %>% group_by(ID) %>% fill(., starts_with("Q")) %>% filter(PANEL == 4)
Output:
ID PANEL AGE Q4 Q4_1 Q4_1.1 Q5
1 1 4 68 2 2 2 10
2 2 4 52 1 1 3 7
3 4 4 33 1 1 3 5
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论