2023年3月21日 00:13:39go评论136阅读模式

英文:

How to separate multiple answers in one column, for multiple columns, by creating extra columns

问题

1. Data

我有一份调查数据：

dat &lt;- structure(list(ID = c(4, 5), Start_time = structure(c(1676454186, 
1676454173), class = c(&quot;POSIXct&quot;, &quot;POSIXt&quot;), tzone = &quot;UTC&quot;), 
End_time = structure(c(1676454352, 1676454642), class = c(&quot;POSIXct&quot;, 
&quot;POSIXt&quot;), tzone = &quot;UTC&quot;), `want_to_change Mult answ` = c(&quot;Yes (for the environment), because it provided a starting point to collectively do something about energy consumption.;&quot;, 
&quot;Yes (because of the gas crisis), because it provided a starting point to collectively do something. ;&quot;
), actually_changed = c(&quot;Yes, I tried to use less energy in the office.&quot;, 
&quot;No, not at all.&quot;), `control Mult answ` = c(&quot;We / I can control the lights.;Closing/opening doors and windows.;&quot;, 
&quot;We / I can control the lights.;Closing/opening doors and windows.;&quot;), `measures_taken Mult answ` = c(&quot;Yes, I checked for lights that were not turned off.; Yes, went home early&quot;, 
&quot;Yes, I checked for lights that were not turned off.;&quot;)), row.names = c(NA, 
-2L), class = c(&quot;data.table&quot;, 
&quot;data.frame&quot))

看起来如下图所示：

2. 数据结构

一些列可以有多个答案。这些列的列名中包含 "Mult answ"，例如第1行、第6列 (dat[1,6])。

&gt; dat[1,6]
                                                control Mult answ
1: We / I can control the lights.;Closing/opening doors and windows.;

3. 问题

我想编写一段代码：

将仅出现一次的答案更改为 Other（因为有许多自定义答案）。
为每个答案选项创建一个单独的列，带有通用后缀。

4. 我尝试过的方法

我首先想选择具有多个答案的列：

# 获取具有多个答案的列
temp &lt;- select(dat,contains(&quot;Mult answ&quot;))
cols_with_more_answers &lt;- names(temp)

然后我想通过分号将列分开（在统计它们和更改唯一的答案之前）。但是我有多个列，永远不知道可能会有多少答案。

# 分开列
tidyr::separate(data.frame(text = dat), text, into = c(&quot;A&quot;, &quot;B&quot;, &quot;C&quot;), sep = &quot;;&quot;, fill = &quot;right&quot;, extra = &quot;drop&quot;)

我应该如何继续？

5. 期望的输出

dat &lt;- structure(list(ID = c(4, 5), 
                       Start_time = structure(c(1676454186, 1676454173), class = c(&quot;POSIXct&quot;, &quot;POSIXt&quot;), tzone = &quot;UTC&quot;), 
                       End_time = structure(c(1676454352, 1676454642), class = c(&quot;POSIXct&quot;, &quot;POSIXt&quot;), tzone = &quot;UTC&quot;), 
                       `want_to_change Mult answ` = c(&quot;Other&quot;, &quot;Other&quot;), 
                       actually_changed = c(&quot;No, not at all.&quot;, &quot;Yes, I tried to use less energy in the office.&quot;), 
                       `control Mult answ A` = c(&quot;We / I can control the lights.&quot;, &quot;We / I can control the lights.&quot;), 
                       `control Mult answ B` = c(&quot;Closing/opening doors and windows&quot;, &quot;Closing/opening doors and windows&quot;), 
                       `measures_taken Mult answ A` = c(&quot;Yes, I checked for lights that were not turned off.&quot;, &quot;Yes, I checked for lights that were not turned off.&quot;), 
                       `measures_taken Mult answ B` = c(NA, &quot;Yes, went home early&quot;)), 
                  row.names = c(NA, -2L), 
                  class = c(&quot;data.table&quot;, &quot;data.frame&quot;))

英文:

1.Data

I have survey data:

dat &lt;- structure(list(ID = c(4, 5), Start_time = structure(c(1676454186, 
1676454173), class = c(&quot;POSIXct&quot;, &quot;POSIXt&quot;), tzone = &quot;UTC&quot;), 
    End_time = structure(c(1676454352, 1676454642), class = c(&quot;POSIXct&quot;, 
    &quot;POSIXt&quot;), tzone = &quot;UTC&quot;), `want_to_change Mult answ` = c(&quot;Yes (for the environment), because it provided a starting point to collectively do something about energy consumption.;&quot;, 
    &quot;Yes (because of the gas crisis), because it provided a starting point to collectively do something. ;&quot;
    ), actually_changed = c(&quot;Yes, I tried to use less energy in the office.&quot;, 
    &quot;No, not at all.&quot;), `control Mult answ` = c(&quot;We / I can control the lights.;Closing/opening doors and windows.;&quot;, 
    &quot;We / I can control the lights.;Closing/opening doors and windows.;&quot;), `measures_taken Mult answ` = c(&quot;Yes, I checked for lights that were not turned off.; Yes, went home early&quot;, 
    &quot;Yes, I checked for lights that were not turned off.;&quot;)), row.names = c(NA, 
-2L), class = c(&quot;data.table&quot;, 
&quot;data.frame&quot;))

that looks as follows:

2. Structure of the data

Some of the columns can have more than one answer. These columns have "Mult answ" in the column name. See for example row 1, column 6 (dat[1,6]).

&gt; dat[1,6]
                                                control Mult answ
1: We / I can control the lights.;Closing/opening doors and windows.;

3.Question

I would like to write a piece of code that:

Changes all answers that only occur once to Other (this is because there are many custom answers).
Creates a separate column for each answer option, with a generic suffix.

4. What I have tried

I thought I would first select the columns that have multiple answers:

# Get columns with more than one answer
temp &lt;- select(dat,contains(&quot;Mult answ&quot;))
cols_with_more_answers &lt;- names(temp)

I then thought to split the columns up by the semicolon (before I count them and change the unique ones to other). But I have multiple columns and NEVER know how many answers there might be..

# Separate columns 
tidyr::separate(data.frame(text = dat), text, into = c(&quot;A&quot;, &quot;B&quot;, &quot;C&quot;), sep = &quot;;&quot;, fill = &quot;right&quot;, extra = &quot;drop&quot;)

How should I continue here?

5. Desired output

dat &lt;- structure(list(ID = c(4, 5), 
                       Start_time = structure(c(1676454186, 1676454173), class = c(&quot;POSIXct&quot;, &quot;POSIXt&quot;), tzone = &quot;UTC&quot;), 
                       End_time = structure(c(1676454352, 1676454642), class = c(&quot;POSIXct&quot;, &quot;POSIXt&quot;), tzone = &quot;UTC&quot;), 
                       `want_to_change Mult answ` = c(&quot;Other&quot;, &quot;Other&quot;), 
                       actually_changed = c(&quot;No, not at all.&quot;, &quot;Yes, I tried to use less energy in the office.&quot;), 
                       `control Mult answ A` = c(&quot;We / I can control the lights.&quot;, &quot;We / I can control the lights.&quot;), 
                       `control Mult answ B` = c(&quot;Closing/opening doors and windows&quot;, &quot;Closing/opening doors and windows&quot;), 
                       `measures_taken Mult answ A` = c(&quot;Yes, I checked for lights that were not turned off.&quot;, &quot;Yes, I checked for lights that were not turned off.&quot;), 
                       `measures_taken Mult answ B` = c(NA, &quot;Yes, went home early&quot;)), 
                  row.names = c(NA, -2L), 
                  class = c(&quot;data.table&quot;, &quot;data.frame&quot;))

答案1

得分: 1

你可以像这样做。
(将问题转换为字母并使其稳定，以防你有超过26个答案，这有点棘手，但我找到了解决方法)

我在代码中留下了一些评论，简而言之：

将多选题的答案转换为行，并使用 separate_rows 分隔答案。
在那一点上，你可以使用 forcats::fct_lump_min 替换只出现一次的答案。
然后，你可以创建一个新的列，将答案转换为字母（为此，我不得不创建函数 values2letters，它调用 expand_letters。第一个函数只是将答案重新编码为字母。第二个函数创建字母。如果你有超过26个答案，字母就不够了，所以该函数会生成字母的组合)。
最后，你可以将答案按其问题和相应的字母分散到组合中，以获得期望的结果。

library(dplyr)
library(tidyr)
# 你需要提供 dat 数据框的定义，否则无法运行这个代码。
# expand_letters 函数也需要在代码中定义，以便正常运行。
# 以下是代码部分的翻译，其余部分不需要翻译。
dat %>%
  
  # 只将多选答案进行重塑
  pivot_longer(ends_with("Mult answ")) %>%
  
  # 使用 ; 分隔多行中的答案
  separate_rows(value, sep = ";") %>%
  
  # 删除空行（自动在行末创建，因为行以 ; 结尾）
  filter(value != "") %>%
  
  # 如果出现不超过2次，更改为 "Other"
  mutate(value = as.character(forcats::fct_lump_min(value, 2))) %>%
  
  # 按问题将答案重新编码为字母
  group_by(name) %>%
  mutate(valueLetters = values2letters(value)) %>%
  ungroup() %>%
  
  # 在有多个 "Other" 的情况下进行去重
  distinct() %>%
  
  # 展开值
  pivot_wider(names_from = c(name, valueLetters), values_from = value, names_sep = " ")

^{2023-03-20创建，使用 reprex v2.0.2}


<details>
<summary>英文:</summary>
You could do something like this.
(converting questions to letters and make it stable in case you had more than 26 answers was a bit tricky but I found a way around it)
I left a few comments into the code, in short:
- Pivot multiple answers questions into rows and separate the answers with `separate_rows`.
- At that point you can replace the answers that appear only once with `forcats::fct_lump_min`.
- Then you can create a new columns to convert answers to letters (for that I had to create the function `values2letters` that calls `expand_letters`. The first function simply recode the answers into letters. The second function create the letters. If you have more than 26 answers, letters wouldn&#39;t be enought so the function makes combinations of letters).
- In the end, you spread the answers over the combination its own question and corresponding letter to get the expected result.
``` r
library(dplyr)
library(tidyr)
expand_letters &lt;- function(l){
  
  # how many times letters must repeat?
  x &lt;- ceiling(log(l, 26))
  
  # correct in case of zero
  x &lt;- max(x,1)
  # repeat the letters
  x &lt;- rep(list(LETTERS), x)
  
  # get combinations
  x &lt;- expand.grid(x)
  
  # collapse letters
  x &lt;- do.call(paste0, rev(x))
  
  # return only the needed ones
  x[seq_len(l)]
  
}
values2letters &lt;- function(x){
  
  x &lt;- factor(x)
  levels &lt;- levels(x)
  l &lt;- length(levels)
  new_levels &lt;- expand_letters(l)
  recode &lt;- setNames(levels, new_levels)
  as.character(forcats::fct_recode(x, !!!recode))
  
}
dat %&gt;%
  
  # pivot only multi answers
  pivot_longer(ends_with(&quot;Mult answ&quot;)) %&gt;% 
  
  # separate by ; in multiple lines
  separate_rows(value, sep = &quot;;&quot;) %&gt;% 
  
  # remove empty rows (automatically created at the end beacuse lines ends with ;)
  filter(value != &quot;&quot;) %&gt;% 
  # change to Other if appears less than 2
  mutate(value = as.character(forcats::fct_lump_min(value, 2))) %&gt;%
  
  # recode to letters by question
  group_by(name) %&gt;% 
  mutate(valueLetters = values2letters(value)) %&gt;% 
  ungroup() %&gt;% 
  
  # distinct in case you have multiple &quot;Other&quot;
  distinct() %&gt;%
  # spread values
  pivot_wider(names_from = c(name, valueLetters), values_from = value, names_sep = &quot; &quot;)
#&gt; # A tibble: 2 x 9
#&gt;      ID Start_time          End_time            actual~1 want_~2 contr~3 contr~4
#&gt;   &lt;dbl&gt; &lt;dttm&gt;              &lt;dttm&gt;              &lt;chr&gt;    &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;  
#&gt; 1     4 2023-02-15 09:43:06 2023-02-15 09:45:52 Yes, I ~ Other   We / I~ Closin~
#&gt; 2     5 2023-02-15 09:42:53 2023-02-15 09:50:42 No, not~ Other   We / I~ Closin~
#&gt; # ... with 2 more variables: `measures_taken Mult answ B` &lt;chr&gt;,
#&gt; #   `measures_taken Mult answ A` &lt;chr&gt;, and abbreviated variable names
#&gt; #   1: actually_changed, 2: `want_to_change Mult answ A`,
#&gt; #   3: `control Mult answ B`, 4: `control Mult answ A`

<sup>Created on 2023-03-20 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在一个列中分隔多个答案，对于多个列，通过创建额外的列

问题

1. Data

2. 数据结构

3. 问题

4. 我尝试过的方法

5. 期望的输出

1.Data

2. Structure of the data

3.Question

4. What I have tried

5. Desired output

答案1

ggplot – 如何绘制完美的对角线？

在Go语言中获取长度最多为N个字符/元素的子字符串/子切片的简单方法

GoLang. I want to check the string for its contents for a criteria. How to do it in GoLang efficiently in terms of speed?

在生成文档目录之前添加执行摘要，并将其编译成微软Word格式。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。