复制列名并拼接

huangapple go评论98阅读模式
英文:

Replicate column names and splice

问题

以下是翻译好的部分:

  1. 我有一个如下的数据框:
  2. df <- data.frame(A = c(2, 0, 1), B = c(0, 3, 2))
  3. # A B
  4. # 1 2 0
  5. # 2 0 3
  6. # 3 1 2

每个单元格中的数字表示相应列名应重复的次数。重复部分应该由分号(;)拼接成单个字符串。预期的输出如下:

  1. # A B
  2. # 1 A;A <NA>
  3. # 2 <NA> B;B;B
  4. # 3 A B;B

我正在寻找一种处理更大数据集的高效方法:

  1. set.seed(1234)
  2. df <- as.data.frame(matrix(sample(0:5, 1e4*26, replace = TRUE), 1e4, 26))
  3. names(df) <- LETTERS
  4. # A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
  5. # 1 3 0 3 5 4 0 3 2 2 1 3 3 4 3 2 4 0 1 4 5 2 5 5 2 0 0
  6. # 2 1 5 1 0 3 3 2 0 1 5 5 2 5 0 2 5 1 1 2 4 5 5 0 5 0 0
  7. # 3 5 5 2 0 1 4 5 4 0 5 5 1 1 1 2 2 4 5 4 5 5 5 0 4 0 0
  8. # ...
  9. # [ reached 'max' / getOption("max.print") -- omitted 9997 rows ]

我更喜欢使用basetidyverse的解决方案。欢迎使用data.table,尽管我对它不太熟悉。

  1. <details>
  2. <summary>英文:</summary>
  3. I have a dataframe as follows:
  4. ```r
  5. df &lt;- data.frame(A = c(2, 0, 1), B = c(0, 3, 2))
  6. # A B
  7. # 1 2 0
  8. # 2 0 3
  9. # 3 1 2

The number in each cell indicates the times for which the corresponding column name should repeat. The replicates should be spliced by semicolons(;) to a single string. The expected output turns out to

  1. # A B
  2. # 1 A;A &lt;NA&gt;
  3. # 2 &lt;NA&gt; B;B;B
  4. # 3 A B;B

I'm searching a efficient way to deal with a much larger dataset:

  1. set.seed(1234)
  2. df &lt;- as.data.frame(matrix(sample(0:5, 1e4*26, replace = TRUE), 1e4, 26))
  3. names(df) &lt;- LETTERS
  4. # A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
  5. # 1 3 0 3 5 4 0 3 2 2 1 3 3 4 3 2 4 0 1 4 5 2 5 5 2 0 0
  6. # 2 1 5 1 0 3 3 2 0 1 5 5 2 5 0 2 5 1 1 2 4 5 5 0 5 0 0
  7. # 3 5 5 2 0 1 4 5 4 0 5 5 1 1 1 2 2 4 5 4 5 5 5 0 4 0 0
  8. # ...
  9. # [ reached &#39;max&#39; / getOption(&quot;max.print&quot;) -- omitted 9997 rows ]

I prefer base or tidyverse solutions. data.table is welcome but I'm unfamiliar to it though.

答案1

得分: 3

以下是已翻译的代码部分:

在R中的基本选项:

  1. df[] <- mapply(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = ";"))),
  2. df, names(df))
  3. # A B
  4. # 1 A;A <NA>
  5. # 2 <NA> B;B;B
  6. # 3 A B;B

转换为tidyverse

  1. library(purrr)
  2. library(dplyr)
  3. df %>%
  4. imap_dfr(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = ";"))))

另一个dplyr选项(但可能较慢):

  1. df %>%
  2. rowwise() %>%
  3. mutate(across(everything(), ~ ifelse(.x == 0, NA, paste(rep(rep(cur_column(), n()), .x), collapse = ";"))))

基准测试结果:

  1. library(microbenchmark)
  2. mb <- microbenchmark(
  3. darren = {res <- df
  4. n <- unlist(res)
  5. res[res > 0] <- sapply(
  6. split(rep.int(names(df)[col(df)], n), rep.int(seq_along(n), n)),
  7. paste, collapse = ';'
  8. )
  9. res[res == 0] <- NA},
  10. mael = mapply(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = ";"))),
  11. df, names(df)),
  12. setup = {
  13. set.seed(1234)
  14. df <- as.data.frame(matrix(sample(0:5, 1e4*26, replace = TRUE), 1e4, 26))
  15. names(df) <- LETTERS
  16. },
  17. times = 10
  18. )
  19. #Unit: seconds
  20. # expr min lq mean median uq max neval
  21. # darren 1.017067 1.042890 1.116436 1.105676 1.203968 1.227238 10
  22. # mael 1.426412 1.518794 1.576665 1.581085 1.628206 1.713553 10
英文:

Here is a base R option:

  1. df[] &lt;- mapply(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = &quot;;&quot;))),
  2. df, names(df))
  3. # A B
  4. # 1 A;A &lt;NA&gt;
  5. # 2 &lt;NA&gt; B;B;B
  6. # 3 A B;B

And converted to tidyverse:

  1. library(purrr)
  2. library(dplyr)
  3. df %&gt;%
  4. imap_dfr(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = &quot;;&quot;))))

Another dplyr option (but probably slow):

  1. df %&gt;%
  2. rowwise() %&gt;%
  3. mutate(across(everything(), ~ ifelse(.x == 0, NA, paste(rep(rep(cur_column(), n()), .x), collapse = &quot;;&quot;))))

Benchmark:

  1. library(microbenchmark)
  2. mb &lt;- microbenchmark(
  3. darren = {res &lt;- df
  4. n &lt;- unlist(res)
  5. res[res &gt; 0] &lt;- sapply(
  6. split(rep.int(names(df)[col(df)], n), rep.int(seq_along(n), n)),
  7. paste, collapse = &#39;;&#39;
  8. )
  9. res[res == 0] &lt;- NA},
  10. mael = mapply(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = &quot;;&quot;))),
  11. df, names(df)),
  12. setup = {
  13. set.seed(1234)
  14. df &lt;- as.data.frame(matrix(sample(0:5, 1e4*26, replace = TRUE), 1e4, 26))
  15. names(df) &lt;- LETTERS
  16. },
  17. times = 10
  18. )
  19. #Unit: seconds
  20. # expr min lq mean median uq max neval
  21. # darren 1.017067 1.042890 1.116436 1.105676 1.203968 1.227238 10
  22. # mael 1.426412 1.518794 1.576665 1.581085 1.628206 1.713553 10
  23. </details>
  24. # 答案2
  25. **得分**: 2
  26. 一种 `base` 选项:
  27. ```r
  28. res <- df
  29. n <- unlist(res)
  30. res[res > 0] <- sapply(
  31. split(rep.int(names(df)[col(df)], n), rep.int(seq_along(n), n)),
  32. paste, collapse = ';'
  33. )
  34. res[res == 0] <- NA
  35. res[1:5, 1:5]
  36. # A B C D E
  37. # 1 A;A;A <NA> C;C;C D;D;D;D;D E;E;E;E
  38. # 2 A B;B;B;B;B C <NA> E;E;E
  39. # 3 A;A;A;A;A B;B;B;B;B C;C <NA> E
  40. # 4 A;A;A;A B;B;B;B <NA> D E;E;E;E
  41. # 5 A;A;A B C;C;C;C D;D E;E;E
英文:

One base option:

  1. res &lt;- df
  2. n &lt;- unlist(res)
  3. res[res &gt; 0] &lt;- sapply(
  4. split(rep.int(names(df)[col(df)], n), rep.int(seq_along(n), n)),
  5. paste, collapse = &#39;;&#39;
  6. )
  7. res[res == 0] &lt;- NA
  8. res[1:5, 1:5]
  9. # A B C D E
  10. # 1 A;A;A &lt;NA&gt; C;C;C D;D;D;D;D E;E;E;E
  11. # 2 A B;B;B;B;B C &lt;NA&gt; E;E;E
  12. # 3 A;A;A;A;A B;B;B;B;B C;C &lt;NA&gt; E
  13. # 4 A;A;A;A B;B;B;B &lt;NA&gt; D E;E;E;E
  14. # 5 A;A;A B C;C;C;C D;D E;E;E

huangapple
  • 本文由 发表于 2023年7月20日 20:37:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76729966.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定