2023年3月7日 22:15:38go评论80阅读模式

英文:

Collapsing multiple observations based on specific parameters in R

问题

我对R相当陌生。我有一个包含8081个观测值和113个变量的数据集。数据是在4个波次（面板）中收集的，有些个体多次接受访谈。有时会问相同的问题，但某些问题只在其中一个波次中提出。大多数答案都是在一个尺度上（例如，您同意多少）或二进制上。每个个体都有唯一的数值ID，因此我知道要合并哪些行。我的因变量仅在第4波次中进行了调查。

数据看起来像这样：

df <- data.frame(ID = c(1, 1, 2, 2, 2, 2, 3, 4, 4), PANEL = c(1, 4, 1, 2, 3, 4, 2, 3, 4),
AGE = c(68, 68, 52, 52, 52, 52, 43, 33, 33), Q4 = c(2, 2, 1, 1, 1, 1, 2, 2, 1),
Q4_1 = c(2, 2, 1, 1, 1, 1, 2, 2, 1), Q4_1 = c(2, NA, NA, 3, NA, 3, 2, 3, NA),
Q5 = c(10, 10, 8, 9, 8, 7, 6, 6, 5))

有时每个个体在不同波次中的答案相同，但并非总是如此。我不需要知道答案如何随时间变化，而且波次之间的时间相对较短。我只需要每个受访者的概要信息。

理想情况下，对于每个个体，我想保留第4波次中的答案（如果他们参与其中），但用之前波次中的答案替换NA答案。我想知道是否有办法在不逐个个体的情况下完成这个操作，鉴于数据量很大。我还必须删除根本没有参与第4波次的个体的数据。

如果成功，上面的数据块最终会看起来像这样（稍后我将删除面板列）：

   ID  PANEL  AGE  Q4  Q4_1 Q5
1   1     4    68   2    2  10 
2   2     4    52   1    3   7 
3   4     4    33   1    3   5

我一直在研究dplyr的summarise()函数，但似乎无法以我需要合并和不合并的方式那么具体。如果一些答案通过获取个体在不同波次中的回答的平均值来合并，那就不是问题，但如果个体在波次之间改变主意，那对于二进制答案将不起作用。

英文:

I am quite new to R. I have a dataset with 8081 observations for 113 variables. The data was collected in 4 waves (panels), with some individuals being interviewed multiple times. They were sometimes asked the same questions, but some questions were only asked during one wave. Most answers were on a scale (e.g. how much do you agree) or a binary. Each individual has a unique numeric ID so I know which rows to collapse. My dependent variable was only surveyed in wave 4.

The data looks something like this:

df &lt;- data.frame (ID  = c(1, 1, 2, 2, 2, 2, 3, 4, 4), PANEL = c(1, 4, 1, 2, 3, 4, 2, 3, 4),
AGE = c(68, 68, 52, 52, 52, 52, 43, 33, 33), Q4 = c(2, 2, 1, 1, 1, 1, 2, 2, 1),
Q4_1 = c(2, 2, 1, 1, 1, 1, 2, 2, 1), Q4_1 = c(2, NA, NA, 3, NA, 3, 2, 3, NA),
Q5 = c(10, 10, 8, 9, 8, 7, 6, 6, 5))

df

  ID PANEL AGE Q4 Q4_1 Q4_1.1 Q5
1  1     1  68  2    2      2 10
2  1     4  68  2    2     NA 10
3  2     1  52  1    1     NA  8
4  2     2  52  1    1      3  9
5  2     3  52  1    1     NA  8
6  2     4  52  1    1      3  7
7  3     2  43  2    2      2  6
8  4     3  33  2    2      3  6
9  4     4  33  1    1     NA  5

etc...

Sometimes each individual's answers are the same across waves, but not always. I do not need to know how the answers vary in time and the waves were relatively close in time. I just need a profile of each individual surveyed.

Ideally, for each individual I'd want to keep the answers given in wave 4 (if they took part in it), but substituting the NA answers with what they answered in previous waves. I was wondering if there is any way to do this without going through every individual one by one, given the amount of data. I'll also have to remove data for individuals who did not take part in wave 4 at all.

If successful, the chunk of data above would end up looking something like this (+ I'll remove the panel column later):

   ID  PANEL  AGE  Q4  Q4_1 Q5
1   1     4    68   2    2  10 
2   2     4    52   1    3   7 
3   4     4    33   1    3   5

etc...

I've been looking into dplyr's summarise() function but it doesn't seem like I can be that specific with what I need to merge and not merge. It wouldn't be a problem if some of the answers were merged by getting an average of the individual's responses across waves, but that would not work for the binary answers if the individual changed their mind in between waves.

答案1

得分: 1

你可以使用 tidyr::fill()。你可以在提供的示例数据集上执行以下操作：

df %>% group_by(ID) %>% fill(., starts_with("Q")) %>% filter(PANEL == 4)

输出结果：

  ID PANEL AGE Q4 Q4_1 Q4_1.1 Q5
1  1     4  68  2    2      2 10
2  2     4  52  1    1      3  7
3  4     4  33  1    1      3  5

英文:

You may be looking for tidyr::fill().. You can do the following on your provided example dataset:

df %&gt;% group_by(ID) %&gt;% fill(., starts_with(&quot;Q&quot;)) %&gt;% filter(PANEL == 4)

Output:

  ID PANEL AGE Q4 Q4_1 Q4_1.1 Q5
1  1     4  68  2    2      2 10
2  2     4  52  1    1      3  7
3  4     4  33  1    1      3  5

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中基于特定参数合并多个观测数据。

问题

答案1

Subscript type for remove an Array column from index in R

Translate: 将R函数帮助添加到Quarto文档中

在R中创建分组条形图时间序列？

使用Rserve和Roger从Golang执行R脚本。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论