英文:
Reshaping large long dataset with numerous out of order responses to wide in R
问题
我有一个包含参与者对多个问卷的回答的大型数据集。我尝试将其转换为宽格式,但问题是并不是所有参与者都回答了所有问卷,因此在转换为宽格式时,他们的回答无法对齐。此外,每个问卷都没有实际的变量,只有一个总的问卷变量,表示参与者正在回答哪个问卷和问题,以及一个单独的回答变量。
长格式的数据集如下所示:
#长格式数据集
subject <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)
questionnaire <- c("Q1_1", "Q1_2", "Q1_3", "Q2_1", "Q2_2", "Q2_3",
"Q2_1", "Q2_2", "Q2_3", "Q3_1", "Q3_2", "Q3_3")
response <- c(1, 2, 1, 4, 3, 1, 2, 1, 5, 3, 1, 2)
uniqid <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
long <- as.data.frame(cbind(subject, questionnaire, response, uniqid))
long
subject questionnaire response uniqid
1 1 Q1_1 1 1
2 1 Q1_2 2 2
3 1 Q1_3 1 3
4 1 Q2_1 4 4
5 1 Q2_2 3 5
6 1 Q2_3 1 6
7 2 Q2_1 2 7
8 2 Q2_2 1 8
9 2 Q2_3 5 9
10 2 Q3_1 3 10
11 2 Q3_2 1 11
12 2 Q3_3 2 12
Q1代表第一个问卷,Q2代表第二个问卷,以此类推,而_1、_2等是每个问卷中的项目。我希望最终的数据如下所示:
subject Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3
1 1 1 2 1 4 3 1
2 2 NA NA NA 2 1 5
以此类推,每个问卷和项目都是如此。
当前的数据看起来更像是这样:
subject questionnaire.1 response.1 questionnaire.2 response.2
1 1 Q1_1 1 Q1_2 2
2 2 Q2_1 2 Q2_2 1
我目前遇到了两个问题。
-
并非所有参与者都回答了所有问卷。他们与其他人开始回答问卷的时间不同。这导致在转换为宽格式时,不同参与者的回答出现在同一列中。例如,如上所示,参与者1在"questionnaire.1"列中回答了问卷1的项目1,而参与者2在"questionnaire.2"列中回答了问卷2的项目1。我不知道如何使每一列代表所有参与者的相同问卷项目。
-
我不想要单独的问卷和回答列,而只想要一个问卷列(例如,标题为Q1_1等),其中包含所有参与者的回答。这本身很容易实现,但由于每列包含不同问卷的多个回答,所以在重新排序列之前,我无法实现这一点。
我考虑过的唯一解决方案是在长格式数据集中为每个参与者没有回答的问卷插入NA行。例如,在上面的示例中,为参与者2没有回答的Q1项目插入3行NA。然而,数据集非常庞大,我相信有更简单的方法来实现我所尝试的目标。
英文:
I have a large dataset that contains participant responses to a number of questionnaires. I'm trying to convert it to wide, but the problem is that not all participants answered all questionnaires, so when converting to wide their responses do not line up. There are also no actual variables for each questionnaire, just an overall questionnaire variable, which denotes which questionnaire and item the participant was responding to, and a separate response variable.
The long dataset looks something like this:
#long dataset
subject<-c(1,1,1,1,1,1,2,2,2,2,2,2)
questionnaire<-c("Q1_1", "Q1_2", "Q1_3", "Q2_1", "Q2_2", "Q2_3",
"Q2_1", "Q2_2", "Q2_3", "Q3_1", "Q3_2", "Q3_3")
response<-c(1,2,1,4,3,1,2,1,5,3,1,2)
uniqid<-c(1,2,3,4,5,6,7,8,9,10,11,12)
long<-as.data.frame(cbind(subject, questionnaire, response, uniqid))
long
subject questionnaire response uniqid
1 1 Q1_1 1 1
2 1 Q1_2 2 2
3 1 Q1_3 1 3
4 1 Q2_1 4 4
5 1 Q2_2 3 5
6 1 Q2_3 1 6
7 2 Q2_1 2 7
8 2 Q2_2 1 8
9 2 Q2_3 5 9
10 2 Q3_1 3 10
11 2 Q3_2 1 11
12 2 Q3_3 2 12
Q1 represents the first questionnaire, Q2 the second etc. and the _1, _2 etc. are the items within each questionnaire. What I want the data to look like eventually is:
subject Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3
1 1 1 2 1 4 3 1
2 2 NA NA NA 2 1 5
And so on for each questionnaire and item.
The data currently looks more like this:
subject questionnaire.1 response.1 questionnaire.2 response.2
1 1 Q1_1 1 Q1_2 2
2 2 Q2_1 2 Q2_2 1
etc.
I'm running into 2 problems at the moment.
-
Not all participants answered all questionnaires. They start questionnaires at different times to others. That leads to responses to different questionnaires in the same columns when converting to wide. i.e. as above, participant 1 is answering questionnaire 1 item 1 in the questionnaire.1 column, while participant 2 is answering questionnaire 2 item 1. I don't know how to make it so each column represents the same questionnaire item for all participants.
-
I don't want a separate questionnaire and response column, rather just one questionnaire column (i.e. titled Q1_1 etc), with responses from all participants below. That would be easy enough to achieve on its own, but because each column has a number of responses to different questionnaires, I cannot achieve this without first reordering the columns.
The only solution I have thought of is to insert NA rows for questionnaires that each participant did not answer in the long dataset. i.e. in the example above, inserting 3 NA rows for participant 2 where they did not answer the Q1 items. The dataset is very large however, and I'm sure there are easier ways to achieve what I'm trying to do.
答案1
得分: 0
library(dplyr)
library(tidyr)
long %>%
select(-uniqid) %>%
pivot_wider(names_from = questionnaire, values_from = response)
英文:
library(dplyr)
library(tidyr)
long |>
select(-uniqid) |>
pivot_wider(names_from = questionnaire, values_from = response)
# A tibble: 2 × 10
subject Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 2 1 4 3 1 NA NA NA
2 2 NA NA NA 2 1 5 3 1 2
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论