使用purrr在多个列上进行多个映射的重新编码。

huangapple go评论63阅读模式
英文:

Using purrr to recode across multiple columns with multiple mappings

问题

我有一个包含问卷调查响应标签的数据框。我总是喜欢创建一个包含项目-答案定义的tibble,然后使用 dplyr::recode() 来将所有项目标签替换为它们相应的定义。为了方便使用,定义的 tibble recode_df 以字符串的形式存储这些对应关系,并在 dplyr::recode() 中可以使用三个感叹号 !!! 进行解包和评估。在以下的示例中,有 4 个项目,两个用于 qa,两个用于 qb,它们共享相同的答案定义。

library(tidyverse)
set.seed(42)

# 列以 `qa` 和 `qb` 开头,共享相同的答案结构
data_df <- tibble(
  qa_1 = sample(c(0, 1), 5, replace = TRUE),
  qa_2 = sample(c(0, 1), 5, replace = TRUE),
  qb_1 = sample(1:5, 5, replace = TRUE),
  qb_3 = sample(1:5, 5, replace = TRUE)
)

# `answer` 列存储用于 `dplyr::recode()` 的字符串定义
recode_df <- tibble(
  question = c("qa", "qb"),
  answer = c(
    'c("0" = "foo0", "1" = "foo1")',
    'c("1" = "bar1", "2" = "bar2", "3" = "bar3", "4" = "bar5", "5" = "bar5")'
  )
)  

# 期望的结果
data_df %>%
  mutate(
    across(
      .cols = starts_with("qa"),
      .fns = ~recode(., !!!eval(parse(text = recode_df$answer[str_detect(recode_df$question, "qa")]))
    ),
    across(
      .cols = starts_with("qb"),
      .fns = ~recode(., !!!eval(parse(text = recode_df$answer[str_detect(recode_df$question, "qb")]))
    )
  )

在上面的示例中,我展示了如何使用 dplyr::mutate()dplyr::across() 来根据 recode_df 中的定义对 qaqb 的列进行重新编码。如果您希望使用 purrr 来优雅地迭代和重新编码,您可以尝试使用 purrr::map2() 函数。如果您需要更多的帮助,请随时提出问题。

英文:

I have a dataframe with questionnaire response labels. I always like to make a tibble with item-answer definitions and then use dplyr::recode() to replace all item labels with their corresponding definitions. For ease of use the definitions tibble recode_df stores these correspondences as strings and within dplyr::recode() they can be unpacked with bangbangbang !!! and evaluated. In the following toy example there are 4 items, two for qa and two for qb that share the same answer definitions.

library(tidyverse)
set.seed(42)

# columns starting with `qa` and `qb` share the same answer structure 
data_df &lt;- tibble(
  qa_1 = sample(c(0, 1), 5, replace = TRUE),
  qa_2 = sample(c(0, 1), 5, replace = TRUE),
  qb_1 = sample(1:5, 5, replace = TRUE),
  qb_3 = sample(1:5, 5, replace = TRUE)
)

# `answer` column stores string definitions for use with `dplyr::recode()`
recode_df &lt;- tibble(
  question = c(&quot;qa&quot;, &quot;qb&quot;),
  answer = c(
    &#39;c(&quot;0&quot; = &quot;foo0&quot;, &quot;1&quot; = &quot;foo1&quot;)&#39;,
    &#39;c(&quot;1&quot; = &quot;bar1&quot;, &quot;2&quot; = &quot;bar2&quot;, &quot;3&quot; = &quot;bar3&quot;, &quot;4&quot; = &quot;bar5&quot;, &quot;5&quot; = &quot;bar5&quot;)&#39;
  )
)  

# Desired result
data_df %&gt;%
  mutate(
    across(
      .cols = starts_with(&quot;qa&quot;),
      .fns = ~recode(., !!!eval(parse(text = recode_df$answer[str_detect(recode_df$question, &quot;qa&quot;)])))
    ),
    across(
      .cols = starts_with(&quot;qb&quot;),
      .fns = ~recode(., !!!eval(parse(text = recode_df$answer[str_detect(recode_df$question, &quot;qb&quot;)])))
    )
  )
#&gt; # A tibble: 5 x 4
#&gt;   qa_1  qa_2  qb_1  qb_3 
#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 foo0  foo1  bar5  bar2 
#&gt; 2 foo0  foo1  bar1  bar3 
#&gt; 3 foo0  foo1  bar5  bar1 
#&gt; 4 foo0  foo0  bar5  bar1 
#&gt; 5 foo1  foo1  bar2  bar3

<sup>Created on 2023-02-26 with reprex v2.0.2</sup>

I can reach my desired result by using one mutate() and across for each row of recode_df, but I am sure there is an elegant purrr solution that iterates and recodes without repeating code. Thank you.

答案1

得分: 2

以下是代码部分的中文翻译:

有一些备选方案需要考虑,尤其是如果要以不同形式存储您的答案关键信息。然而,鉴于目前的数据框,您可以尝试以下方法。使用 `map_dfc` 来将最终结果按列拼接。您可以将重新编码函数应用于字符值向量的每个元素,比如 "qa""qb"。如果这对您有帮助,请告诉我。

library(tidyverse)

map_dfc(
  recode_df$question,
  \(x) {
    map(
      select(data_df, contains(x)),
      \(y) recode(y, !!!eval(parse(text = recode_df$answer[str_detect(recode_df$question, x)])))
    )
  }
)

输出结果

  qa_1  qa_2  qb_1  qb_3 
  <chr> <chr> <chr> <chr>
1 foo0  foo1  bar5  bar2 
2 foo0  foo1  bar1  bar3 
3 foo0  foo1  bar5  bar1 
4 foo0  foo0  bar5  bar1 
5 foo1  foo1  bar2  bar3 
英文:

There are a number of alternatives to consider, especially if storing your answer key in a different form. However, given the present data.frames, you could try the following. Using map_dfc to column-bind your end result. You can apply your recoding function to each element of a vector of character values, such as "qa" and "qb". Let me know if this helps.

library(tidyverse)

map_dfc(
  recode_df$question,
  \(x) {
    map(
      select(data_df, contains(x)),
      \(y) recode(y, !!!eval(parse(text = recode_df$answer[str_detect(recode_df$question, x)])))
    )
  }
)

Output

  qa_1  qa_2  qb_1  qb_3 
  &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
1 foo0  foo1  bar5  bar2 
2 foo0  foo1  bar1  bar3 
3 foo0  foo1  bar5  bar1 
4 foo0  foo0  bar5  bar1 
5 foo1  foo1  bar2  bar3 

答案2

得分: 1

data_df[] <- lapply(names(data_df), (x) if (grepl('qa', x)) paste0('foo', data_df[[x]]) else paste0('bar', data_df[[x]]))

如果有更多列,您可以使用一个简单的字典 dc,其中包含以数据前缀为元素和列前缀为名称的命名向量。

dc <- c(qa='foo', qb='bar')

或者使用 grep 来识别列

dc <- setNames(c('foo', 'bar'), unique(gsub('_\d+$', '', names(data_df))))

现在,我们可以将列名和 dc 的名称传递给 startsWith,以识别 dc 中的正确条目。

data_df[] <- lapply(names(data_df), (x) paste0(dc[startsWith(x, names(dc))], data_df[[x]]))

data_df

qa_1 qa_2 qb_1 qb_3

1 foo0 foo1 bar4 bar2

2 foo0 foo1 bar1 bar3

3 foo0 foo1 bar5 bar1

4 foo0 foo0 bar4 bar1

5 foo1 foo1 bar2 bar3

这也适用于具有数百列的情况。很难避免一次性定义翻译。

英文:

You can have that cheaper.

data_df[] &lt;- lapply(names(data_df), \(x) if (grepl(&#39;qa&#39;, x)) paste0(&#39;foo&#39;, data_df[[x]]) else paste0(&#39;bar&#39;, data_df[[x]]))

If there are much more columns, you can use a simple dictionary dc consisting of a named vector with data prefixes as elements and column prefixes as names.

dc &lt;- c(qa=&#39;foo&#39;, qb=&#39;bar&#39;)
## alternatively using `grep` to identify columns
# dc &lt;- setNames(c(&#39;foo&#39;, &#39;bar&#39;), unique(gsub(&#39;_\\d+$&#39;, &#39;&#39;, names(data_df))))

We can now feed startsWith with names column name and dc-name to identify the correct entry in dc.

data_df[] &lt;- lapply(names(data_df), \(x) paste0(dc[startsWith(x, names(dc))], data_df[[x]]))

data_df
#   qa_1 qa_2 qb_1 qb_3
# 1 foo0 foo1 bar4 bar2
# 2 foo0 foo1 bar1 bar3
# 3 foo0 foo1 bar5 bar1
# 4 foo0 foo0 bar4 bar1
# 5 foo1 foo1 bar2 bar3

This should also work well with hundreds of columns. It might be hard to avoid to define the translation once.

huangapple
  • 本文由 发表于 2023年2月26日 19:31:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/75571667.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定