将多列中的值扩展为二进制值

huangapple go评论86阅读模式
英文:

Expanding values from multiple columns into binary values

问题

我有包含数字或NA值的列,我想重新排列它们,以使新列以不同的数字值命名,并且值为二进制的0或1值。

实际数据集包含许多其他变量,因此代码需要将函数细化到仅处理"stress"变量。

示例数据集:

  1. df <- data.frame(
  2. stress1 = c('A', 'A', 'B', 'A'),
  3. stress2 = c(NA, 'B', 'C', 'B'),
  4. stress3 = c(NA, NA, NA, 'C'),
  5. stress4 = c(NA, NA, NA, 'D')
  6. )

期望的结果:

  1. desiredoutcome <- data.frame (
  2. A = c(1, 1, 0, 1),
  3. B = c(0, 1, 1, 1),
  4. C = c(0, 0, 1, 1),
  5. D = c(0, 0, 0, 1)
  6. )
英文:

I have columns with either numeric or NA values that I want to reorder so that the new columns are named with the distinct numeric values, and the values are binary 0 or 1 values.

The actual dataset contains many other variables, so the code would need to refine the function only to the 'stress' variables.

Example dataset:

  1. df &lt;- data.frame(
  2. stress1 = c(&#39;A&#39;, &#39;A&#39;, &#39;B&#39;, &#39;A&#39;),
  3. stress2 = c(NA, &#39;B&#39;, &#39;C&#39;, &#39;B&#39;),
  4. stress3 = c(NA, NA, NA, &#39;C&#39;),
  5. stress4 = c(NA, NA, NA, &#39;D&#39;)
  6. )

Desired outcome:

  1. desiredoutcome &lt;- data.frame (
  2. A = c(1, 1, 0, 1),
  3. B = c(0, 1, 1, 1),
  4. C = c(0, 0, 1, 1),
  5. D = c(0, 0, 0, 1)
  6. )

答案1

得分: 2

tidyr + dplyr 中:

  1. library(tidyr)
  2. library(dplyr)
  3. df %>%
  4. mutate(id = row_number()) %>%
  5. pivot_longer(-id, values_drop_na = TRUE) %>%
  6. pivot_wider(names_from = "value", values_from = "name",
  7. values_fill = 0, values_fn = length)
  8. # id A B C D
  9. # 1 1 1 0 0 0
  10. # 2 2 1 1 0 0
  11. # 3 3 0 1 1 0
  12. # 4 4 1 1 1 1

或者在基本 R 中:

  1. df$ID <- seq_along(df)
  2. table(cbind(df['ID'], unlist(df[1:4]))) |
  3. as.data.frame.matrix()
  4. # A B C D
  5. # 1 1 0 0 0
  6. # 2 1 1 0 0
  7. # 3 0 1 1 0
  8. # 4 1 1 1 1
英文:

In tidyr + dplyr.

  1. library(tidyr)
  2. library(dplyr)
  3. df %&gt;%
  4. mutate(id = row_number()) %&gt;%
  5. pivot_longer(-id, values_drop_na = TRUE) %&gt;%
  6. pivot_wider(names_from = &quot;value&quot;, values_from = &quot;name&quot;,
  7. values_fill = 0, values_fn = length)
  8. # id A B C D
  9. # 1 1 1 0 0 0
  10. # 2 2 1 1 0 0
  11. # 3 3 0 1 1 0
  12. # 4 4 1 1 1 1

Or in base R:

  1. df$ID &lt;- seq_along(df)
  2. table(cbind(df[&#39;ID&#39;], unlist(df[1:4]))) |&gt;
  3. as.data.frame.matrix()
  4. # A B C D
  5. # 1 1 0 0 0
  6. # 2 1 1 0 0
  7. # 3 0 1 1 0
  8. # 4 1 1 1 1

答案2

得分: 2

以下是您要翻译的内容:

  1. Base R方式,通用于希望适用于任意数量不同字符串的实际数据的情况:
  2. # 更新的示例数据框 - NA不应被引用,去除尾随逗号
  3. df <- data.frame(
  4. stress1 = c('A', 'A', 'B', 'A'),
  5. stress2 = c(NA, 'B', 'C', 'B'),
  6. stress3 = c(NA, NA, NA, 'C'),
  7. stress4 = c(NA, NA, NA, 'D')
  8. )
  9. desiredout <- data.frame (
  10. A = c(1, 1, 0, 1),
  11. B = c(0, 1, 1, 1),
  12. C = c(0, 0, 1, 1),
  13. D = c(0, 0, 0, 1)
  14. )
  15. out = data.frame( # 用于数据框输出
  16. lapply( # 迭代
  17. unique(
  18. unlist(df[!is.na(df)]) # df中的所有唯一非NA值
  19. ),
  20. \(x) {
  21. rowSums(df == x, na.rm = TRUE)
  22. }
  23. ))
  24. names(out) <- unique(unlist(df[!is.na(df)]))

检查:

  1. > all.equal(desiredout, out)
  2. [1] TRUE

以及:

  1. > out
  2. A B C D
  3. 1 1 0 0 0
  4. 2 1 1 0 0
  5. 3 0 1 1 0
  6. 4 1 1 1 1

如果您发现在最后一列中除了1或0以外的数字,那么可能有行中有多个相同字符串的实例 - 如果是这样,请回来,我们可以进行相应的编辑。

英文:

Base R way, generalised to hopefully work for real data with any number of different strings in:

  1. # updated sample data frame - NA&#39;s should not be quoted, removed trailing comma
  2. df &lt;- data.frame(
  3. stress1 = c(&#39;A&#39;, &#39;A&#39;, &#39;B&#39;, &#39;A&#39;),
  4. stress2 = c(NA, &#39;B&#39;, &#39;C&#39;, &#39;B&#39;),
  5. stress3 = c(NA, NA, NA, &#39;C&#39;),
  6. stress4 = c(NA, NA, NA, &#39;D&#39;)
  7. )
  8. desiredout &lt;- data.frame (
  9. A = c(1, 1, 0, 1),
  10. B = c(0, 1, 1, 1),
  11. C = c(0, 0, 1, 1),
  12. D = c(0, 0, 0, 1)
  13. )
  14. out = data.frame( # for data frame output
  15. lapply( # iterative
  16. unique(
  17. unlist(df[!is.na(df)]) # all unique, non-NA values in df
  18. ),
  19. \(x) {
  20. rowSums(df == x, na.rm = TRUE)
  21. }
  22. ))
  23. names(out) &lt;- unique(unlist(df[!is.na(df)]))

check:

  1. &gt; all.equal(desiredout, out)
  2. [1] TRUE

and:

  1. &gt; out
  2. A B C D
  3. 1 1 0 0 0
  4. 2 1 1 0 0
  5. 3 0 1 1 0
  6. 4 1 1 1 1

If you find you have figures other than 1 or 0 in your final column then you may have rows with more than one instance of a given string - if so, come back and we can edit appropriately.

答案3

得分: 2

A purrr solution with pmap_dfr + table:

  1. library(purrr)
  2. pmap_dfr(df, ~ unclass(table(c(...)))) %>% replace(is.na(.), 0)
英文:

A purrr solution with pmap_dfr + table:

  1. library(purrr)
  2. pmap_dfr(df, ~ unclass(table(c(...)))) %&gt;%
  3. replace(is.na(.), 0)
  4. # # A tibble: 4 &#215; 4
  5. # A B C D
  6. # &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
  7. # 1 1 0 0 0
  8. # 2 1 1 0 0
  9. # 3 0 1 1 0
  10. # 4 1 1 1 1

huangapple
  • 本文由 发表于 2023年7月17日 19:29:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76704015.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定