英文:
How to separate multiple answers in one column, for multiple columns, by creating extra columns
问题
1. Data
我有一份调查数据:
dat <- structure(list(ID = c(4, 5), Start_time = structure(c(1676454186,
1676454173), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
End_time = structure(c(1676454352, 1676454642), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), `want_to_change Mult answ` = c("Yes (for the environment), because it provided a starting point to collectively do something about energy consumption.;",
"Yes (because of the gas crisis), because it provided a starting point to collectively do something. ;"
), actually_changed = c("Yes, I tried to use less energy in the office.",
"No, not at all."), `control Mult answ` = c("We / I can control the lights.;Closing/opening doors and windows.;",
"We / I can control the lights.;Closing/opening doors and windows.;"), `measures_taken Mult answ` = c("Yes, I checked for lights that were not turned off.; Yes, went home early",
"Yes, I checked for lights that were not turned off.;")), row.names = c(NA,
-2L), class = c("data.table",
"data.frame"))
看起来如下图所示:
2. 数据结构
一些列可以有多个答案。这些列的列名中包含 "Mult answ"
,例如第1行、第6列 (dat[1,6]
)。
> dat[1,6]
control Mult answ
1: We / I can control the lights.;Closing/opening doors and windows.;
3. 问题
我想编写一段代码:
- 将仅出现一次的答案更改为
Other
(因为有许多自定义答案)。 - 为每个答案选项创建一个单独的列,带有通用后缀。
4. 我尝试过的方法
我首先想选择具有多个答案的列:
# 获取具有多个答案的列
temp <- select(dat,contains("Mult answ"))
cols_with_more_answers <- names(temp)
然后我想通过分号将列分开(在统计它们和更改唯一的答案之前)。但是我有多个列,永远不知道可能会有多少答案。
# 分开列
tidyr::separate(data.frame(text = dat), text, into = c("A", "B", "C"), sep = ";", fill = "right", extra = "drop")
我应该如何继续?
5. 期望的输出
dat <- structure(list(ID = c(4, 5),
Start_time = structure(c(1676454186, 1676454173), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
End_time = structure(c(1676454352, 1676454642), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`want_to_change Mult answ` = c("Other", "Other"),
actually_changed = c("No, not at all.", "Yes, I tried to use less energy in the office."),
`control Mult answ A` = c("We / I can control the lights.", "We / I can control the lights."),
`control Mult answ B` = c("Closing/opening doors and windows", "Closing/opening doors and windows"),
`measures_taken Mult answ A` = c("Yes, I checked for lights that were not turned off.", "Yes, I checked for lights that were not turned off."),
`measures_taken Mult answ B` = c(NA, "Yes, went home early")),
row.names = c(NA, -2L),
class = c("data.table", "data.frame"))
英文:
1.Data
I have survey data:
dat <- structure(list(ID = c(4, 5), Start_time = structure(c(1676454186,
1676454173), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
End_time = structure(c(1676454352, 1676454642), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), `want_to_change Mult answ` = c("Yes (for the environment), because it provided a starting point to collectively do something about energy consumption.;",
"Yes (because of the gas crisis), because it provided a starting point to collectively do something. ;"
), actually_changed = c("Yes, I tried to use less energy in the office.",
"No, not at all."), `control Mult answ` = c("We / I can control the lights.;Closing/opening doors and windows.;",
"We / I can control the lights.;Closing/opening doors and windows.;"), `measures_taken Mult answ` = c("Yes, I checked for lights that were not turned off.; Yes, went home early",
"Yes, I checked for lights that were not turned off.;")), row.names = c(NA,
-2L), class = c("data.table",
"data.frame"))
that looks as follows:
2. Structure of the data
Some of the columns can have more than one answer. These columns have "Mult answ"
in the column name. See for example row 1, column 6 (dat[1,6]
).
> dat[1,6]
control Mult answ
1: We / I can control the lights.;Closing/opening doors and windows.;
3.Question
I would like to write a piece of code that:
- Changes all answers that only occur once to
Other
(this is because there are many custom answers). - Creates a separate column for each answer option, with a generic suffix.
4. What I have tried
I thought I would first select the columns that have multiple answers:
# Get columns with more than one answer
temp <- select(dat,contains("Mult answ"))
cols_with_more_answers <- names(temp)
I then thought to split the columns up by the semicolon (before I count them and change the unique ones to other
). But I have multiple columns and NEVER know how many answers there might be..
# Separate columns
tidyr::separate(data.frame(text = dat), text, into = c("A", "B", "C"), sep = ";", fill = "right", extra = "drop")
How should I continue here?
5. Desired output
dat <- structure(list(ID = c(4, 5),
Start_time = structure(c(1676454186, 1676454173), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
End_time = structure(c(1676454352, 1676454642), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`want_to_change Mult answ` = c("Other", "Other"),
actually_changed = c("No, not at all.", "Yes, I tried to use less energy in the office."),
`control Mult answ A` = c("We / I can control the lights.", "We / I can control the lights."),
`control Mult answ B` = c("Closing/opening doors and windows", "Closing/opening doors and windows"),
`measures_taken Mult answ A` = c("Yes, I checked for lights that were not turned off.", "Yes, I checked for lights that were not turned off."),
`measures_taken Mult answ B` = c(NA, "Yes, went home early")),
row.names = c(NA, -2L),
class = c("data.table", "data.frame"))
答案1
得分: 1
你可以像这样做。
(将问题转换为字母并使其稳定,以防你有超过26个答案,这有点棘手,但我找到了解决方法)
我在代码中留下了一些评论,简而言之:
-
将多选题的答案转换为行,并使用
separate_rows
分隔答案。 -
在那一点上,你可以使用
forcats::fct_lump_min
替换只出现一次的答案。 -
然后,你可以创建一个新的列,将答案转换为字母(为此,我不得不创建函数
values2letters
,它调用expand_letters
。第一个函数只是将答案重新编码为字母。第二个函数创建字母。如果你有超过26个答案,字母就不够了,所以该函数会生成字母的组合)。 -
最后,你可以将答案按其问题和相应的字母分散到组合中,以获得期望的结果。
library(dplyr)
library(tidyr)
# 你需要提供 dat 数据框的定义,否则无法运行这个代码。
# expand_letters 函数也需要在代码中定义,以便正常运行。
# 以下是代码部分的翻译,其余部分不需要翻译。
dat %>%
# 只将多选答案进行重塑
pivot_longer(ends_with("Mult answ")) %>%
# 使用 ; 分隔多行中的答案
separate_rows(value, sep = ";") %>%
# 删除空行(自动在行末创建,因为行以 ; 结尾)
filter(value != "") %>%
# 如果出现不超过2次,更改为 "Other"
mutate(value = as.character(forcats::fct_lump_min(value, 2))) %>%
# 按问题将答案重新编码为字母
group_by(name) %>%
mutate(valueLetters = values2letters(value)) %>%
ungroup() %>%
# 在有多个 "Other" 的情况下进行去重
distinct() %>%
# 展开值
pivot_wider(names_from = c(name, valueLetters), values_from = value, names_sep = " ")
2023-03-20创建,使用 reprex v2.0.2
<details>
<summary>英文:</summary>
You could do something like this.
(converting questions to letters and make it stable in case you had more than 26 answers was a bit tricky but I found a way around it)
I left a few comments into the code, in short:
- Pivot multiple answers questions into rows and separate the answers with `separate_rows`.
- At that point you can replace the answers that appear only once with `forcats::fct_lump_min`.
- Then you can create a new columns to convert answers to letters (for that I had to create the function `values2letters` that calls `expand_letters`. The first function simply recode the answers into letters. The second function create the letters. If you have more than 26 answers, letters wouldn't be enought so the function makes combinations of letters).
- In the end, you spread the answers over the combination its own question and corresponding letter to get the expected result.
``` r
library(dplyr)
library(tidyr)
expand_letters <- function(l){
# how many times letters must repeat?
x <- ceiling(log(l, 26))
# correct in case of zero
x <- max(x,1)
# repeat the letters
x <- rep(list(LETTERS), x)
# get combinations
x <- expand.grid(x)
# collapse letters
x <- do.call(paste0, rev(x))
# return only the needed ones
x[seq_len(l)]
}
values2letters <- function(x){
x <- factor(x)
levels <- levels(x)
l <- length(levels)
new_levels <- expand_letters(l)
recode <- setNames(levels, new_levels)
as.character(forcats::fct_recode(x, !!!recode))
}
dat %>%
# pivot only multi answers
pivot_longer(ends_with("Mult answ")) %>%
# separate by ; in multiple lines
separate_rows(value, sep = ";") %>%
# remove empty rows (automatically created at the end beacuse lines ends with ;)
filter(value != "") %>%
# change to Other if appears less than 2
mutate(value = as.character(forcats::fct_lump_min(value, 2))) %>%
# recode to letters by question
group_by(name) %>%
mutate(valueLetters = values2letters(value)) %>%
ungroup() %>%
# distinct in case you have multiple "Other"
distinct() %>%
# spread values
pivot_wider(names_from = c(name, valueLetters), values_from = value, names_sep = " ")
#> # A tibble: 2 x 9
#> ID Start_time End_time actual~1 want_~2 contr~3 contr~4
#> <dbl> <dttm> <dttm> <chr> <chr> <chr> <chr>
#> 1 4 2023-02-15 09:43:06 2023-02-15 09:45:52 Yes, I ~ Other We / I~ Closin~
#> 2 5 2023-02-15 09:42:53 2023-02-15 09:50:42 No, not~ Other We / I~ Closin~
#> # ... with 2 more variables: `measures_taken Mult answ B` <chr>,
#> # `measures_taken Mult answ A` <chr>, and abbreviated variable names
#> # 1: actually_changed, 2: `want_to_change Mult answ A`,
#> # 3: `control Mult answ B`, 4: `control Mult answ A`
<sup>Created on 2023-03-20 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论