2023年2月8日 12:07:27go评论102阅读模式

英文:

Reshaping large long dataset with numerous out of order responses to wide in R

问题

我有一个包含参与者对多个问卷的回答的大型数据集。我尝试将其转换为宽格式，但问题是并不是所有参与者都回答了所有问卷，因此在转换为宽格式时，他们的回答无法对齐。此外，每个问卷都没有实际的变量，只有一个总的问卷变量，表示参与者正在回答哪个问卷和问题，以及一个单独的回答变量。

长格式的数据集如下所示：

#长格式数据集 
subject <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)
questionnaire <- c("Q1_1", "Q1_2", "Q1_3", "Q2_1", "Q2_2", "Q2_3", 
                 "Q2_1", "Q2_2", "Q2_3", "Q3_1", "Q3_2", "Q3_3")
response <- c(1, 2, 1, 4, 3, 1, 2, 1, 5, 3, 1, 2)
uniqid <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
long <- as.data.frame(cbind(subject, questionnaire, response, uniqid))
long
   subject questionnaire response uniqid
1        1          Q1_1        1      1
2        1          Q1_2        2      2
3        1          Q1_3        1      3
4        1          Q2_1        4      4
5        1          Q2_2        3      5
6        1          Q2_3        1      6
7        2          Q2_1        2      7
8        2          Q2_2        1      8
9        2          Q2_3        5      9
10       2          Q3_1        3     10
11       2          Q3_2        1     11
12       2          Q3_3        2     12

Q1代表第一个问卷，Q2代表第二个问卷，以此类推，而_1、_2等是每个问卷中的项目。我希望最终的数据如下所示：

 subject    Q1_1    Q1_2    Q1_3    Q2_1    Q2_2    Q2_3  
1          1       1       2       1       4       3       1
2          2       NA      NA      NA      2       1       5

以此类推，每个问卷和项目都是如此。

当前的数据看起来更像是这样：

 subject    questionnaire.1    response.1    questionnaire.2    response.2 
1         1               Q1_1             1               Q1_2             2
2         2               Q2_1             2               Q2_2             1

我目前遇到了两个问题。

并非所有参与者都回答了所有问卷。他们与其他人开始回答问卷的时间不同。这导致在转换为宽格式时，不同参与者的回答出现在同一列中。例如，如上所示，参与者1在"questionnaire.1"列中回答了问卷1的项目1，而参与者2在"questionnaire.2"列中回答了问卷2的项目1。我不知道如何使每一列代表所有参与者的相同问卷项目。
我不想要单独的问卷和回答列，而只想要一个问卷列（例如，标题为Q1_1等），其中包含所有参与者的回答。这本身很容易实现，但由于每列包含不同问卷的多个回答，所以在重新排序列之前，我无法实现这一点。

我考虑过的唯一解决方案是在长格式数据集中为每个参与者没有回答的问卷插入NA行。例如，在上面的示例中，为参与者2没有回答的Q1项目插入3行NA。然而，数据集非常庞大，我相信有更简单的方法来实现我所尝试的目标。

英文:

I have a large dataset that contains participant responses to a number of questionnaires. I'm trying to convert it to wide, but the problem is that not all participants answered all questionnaires, so when converting to wide their responses do not line up. There are also no actual variables for each questionnaire, just an overall questionnaire variable, which denotes which questionnaire and item the participant was responding to, and a separate response variable.

The long dataset looks something like this:

#long dataset 
subject&lt;-c(1,1,1,1,1,1,2,2,2,2,2,2)
questionnaire&lt;-c(&quot;Q1_1&quot;, &quot;Q1_2&quot;, &quot;Q1_3&quot;, &quot;Q2_1&quot;, &quot;Q2_2&quot;, &quot;Q2_3&quot;, 
                 &quot;Q2_1&quot;, &quot;Q2_2&quot;, &quot;Q2_3&quot;, &quot;Q3_1&quot;, &quot;Q3_2&quot;, &quot;Q3_3&quot;)
response&lt;-c(1,2,1,4,3,1,2,1,5,3,1,2)
uniqid&lt;-c(1,2,3,4,5,6,7,8,9,10,11,12)
long&lt;-as.data.frame(cbind(subject, questionnaire, response, uniqid))
long
   subject questionnaire response uniqid
1        1          Q1_1        1      1
2        1          Q1_2        2      2
3        1          Q1_3        1      3
4        1          Q2_1        4      4
5        1          Q2_2        3      5
6        1          Q2_3        1      6
7        2          Q2_1        2      7
8        2          Q2_2        1      8
9        2          Q2_3        5      9
10       2          Q3_1        3     10
11       2          Q3_2        1     11
12       2          Q3_3        2     12

Q1 represents the first questionnaire, Q2 the second etc. and the _1, _2 etc. are the items within each questionnaire. What I want the data to look like eventually is:

     subject    Q1_1    Q1_2    Q1_3    Q2_1    Q2_2    Q2_3  
1          1       1       2       1       4       3       1
2          2       NA      NA      NA      2       1       5

And so on for each questionnaire and item.

The data currently looks more like this:

    subject    questionnaire.1    response.1    questionnaire.2    response.2 
1         1               Q1_1             1               Q1_2             2
2         2               Q2_1             2               Q2_2             1

etc.

I'm running into 2 problems at the moment.

Not all participants answered all questionnaires. They start questionnaires at different times to others. That leads to responses to different questionnaires in the same columns when converting to wide. i.e. as above, participant 1 is answering questionnaire 1 item 1 in the questionnaire.1 column, while participant 2 is answering questionnaire 2 item 1. I don't know how to make it so each column represents the same questionnaire item for all participants.
I don't want a separate questionnaire and response column, rather just one questionnaire column (i.e. titled Q1_1 etc), with responses from all participants below. That would be easy enough to achieve on its own, but because each column has a number of responses to different questionnaires, I cannot achieve this without first reordering the columns.

The only solution I have thought of is to insert NA rows for questionnaires that each participant did not answer in the long dataset. i.e. in the example above, inserting 3 NA rows for participant 2 where they did not answer the Q1 items. The dataset is very large however, and I'm sure there are easier ways to achieve what I'm trying to do.

答案1

得分: 0

library(dplyr)
library(tidyr)
long %>%
  select(-uniqid) %>%
  pivot_wider(names_from = questionnaire, values_from = response)

英文:

library(dplyr)
library(tidyr)
long |&gt; 
  select(-uniqid) |&gt; 
  pivot_wider(names_from = questionnaire, values_from = response)
# A tibble: 2 &#215; 10
  subject Q1_1  Q1_2  Q1_3  Q2_1  Q2_2  Q2_3  Q3_1  Q3_2  Q3_3 
  &lt;chr&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
1 1       1     2     1     4     3     1     NA    NA    NA   
2 2       NA    NA    NA    2     1     5     3     1     2

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将具有大量无序响应的长数据集在R中重塑为宽数据集

问题

答案1

如何为apply()函数格式化我的函数以计算特定列？

返回列表中向量的特定元素

在一个干净的会话中，逐行验证R脚本从头到尾的成功执行，没有错误。

如何解决在R中使用gsub函数时出现的.checkTypos(e, names_x)错误。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。