复制列名并拼接

huangapple go评论60阅读模式
英文:

Replicate column names and splice

问题

以下是翻译好的部分:

我有一个如下的数据框:

df <- data.frame(A = c(2, 0, 1), B = c(0, 3, 2))

#   A B
# 1 2 0
# 2 0 3
# 3 1 2

每个单元格中的数字表示相应列名应重复的次数。重复部分应该由分号(;)拼接成单个字符串。预期的输出如下:

#      A     B
# 1  A;A  <NA>
# 2 <NA> B;B;B
# 3    A   B;B

我正在寻找一种处理更大数据集的高效方法:

set.seed(1234)
df <- as.data.frame(matrix(sample(0:5, 1e4*26, replace = TRUE), 1e4, 26))
names(df) <- LETTERS

#    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 1  3 0 3 5 4 0 3 2 2 1 3 3 4 3 2 4 0 1 4 5 2 5 5 2 0 0
# 2  1 5 1 0 3 3 2 0 1 5 5 2 5 0 2 5 1 1 2 4 5 5 0 5 0 0
# 3  5 5 2 0 1 4 5 4 0 5 5 1 1 1 2 2 4 5 4 5 5 5 0 4 0 0
# ...
# [ reached 'max' / getOption("max.print") -- omitted 9997 rows ]

我更喜欢使用basetidyverse的解决方案。欢迎使用data.table,尽管我对它不太熟悉。


<details>
<summary>英文:</summary>

I have a dataframe as follows:

```r
df &lt;- data.frame(A = c(2, 0, 1), B = c(0, 3, 2))

#   A B
# 1 2 0
# 2 0 3
# 3 1 2

The number in each cell indicates the times for which the corresponding column name should repeat. The replicates should be spliced by semicolons(;) to a single string. The expected output turns out to

#      A     B
# 1  A;A  &lt;NA&gt;
# 2 &lt;NA&gt; B;B;B
# 3    A   B;B

I'm searching a efficient way to deal with a much larger dataset:

set.seed(1234)
df &lt;- as.data.frame(matrix(sample(0:5, 1e4*26, replace = TRUE), 1e4, 26))
names(df) &lt;- LETTERS

#    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 1  3 0 3 5 4 0 3 2 2 1 3 3 4 3 2 4 0 1 4 5 2 5 5 2 0 0
# 2  1 5 1 0 3 3 2 0 1 5 5 2 5 0 2 5 1 1 2 4 5 5 0 5 0 0
# 3  5 5 2 0 1 4 5 4 0 5 5 1 1 1 2 2 4 5 4 5 5 5 0 4 0 0
# ...
# [ reached &#39;max&#39; / getOption(&quot;max.print&quot;) -- omitted 9997 rows ]

I prefer base or tidyverse solutions. data.table is welcome but I'm unfamiliar to it though.

答案1

得分: 3

以下是已翻译的代码部分:

在R中的基本选项:

df[] <- mapply(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = ";"))),
       df, names(df))

#      A     B
# 1  A;A  <NA>
# 2 <NA> B;B;B
# 3    A   B;B

转换为tidyverse

library(purrr)
library(dplyr)
df %>%
  imap_dfr(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = ";"))))

另一个dplyr选项(但可能较慢):

df %>%
  rowwise() %>%
  mutate(across(everything(), ~ ifelse(.x == 0, NA, paste(rep(rep(cur_column(), n()), .x), collapse = ";"))))

基准测试结果:

library(microbenchmark)
mb <- microbenchmark(
  darren = {res <- df
  n <- unlist(res)
  res[res > 0] <- sapply(
    split(rep.int(names(df)[col(df)], n), rep.int(seq_along(n), n)),
    paste, collapse = ';'
  )
  res[res == 0] <- NA},
  mael = mapply(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = ";"))),
                df, names(df)),
  setup = {
    set.seed(1234)
    df <- as.data.frame(matrix(sample(0:5, 1e4*26, replace = TRUE), 1e4, 26))
    names(df) <- LETTERS
  },
  times = 10
)

#Unit: seconds
#   expr      min       lq     mean   median       uq      max neval
# darren 1.017067 1.042890 1.116436 1.105676 1.203968 1.227238    10
#   mael 1.426412 1.518794 1.576665 1.581085 1.628206 1.713553    10
英文:

Here is a base R option:

df[] &lt;- mapply(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = &quot;;&quot;))),
       df, names(df))

#      A     B
# 1  A;A  &lt;NA&gt;
# 2 &lt;NA&gt; B;B;B
# 3    A   B;B

And converted to tidyverse:

library(purrr)
library(dplyr)
df %&gt;% 
 imap_dfr(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = &quot;;&quot;))))

Another dplyr option (but probably slow):

df %&gt;% 
  rowwise() %&gt;% 
  mutate(across(everything(), ~ ifelse(.x == 0, NA, paste(rep(rep(cur_column(), n()), .x), collapse = &quot;;&quot;))))

Benchmark:

library(microbenchmark)
mb &lt;- microbenchmark(
  darren = {res &lt;- df
  n &lt;- unlist(res)
  res[res &gt; 0] &lt;- sapply(
    split(rep.int(names(df)[col(df)], n), rep.int(seq_along(n), n)),
    paste, collapse = &#39;;&#39;
  )
  res[res == 0] &lt;- NA},
  mael = mapply(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = &quot;;&quot;))),
                df, names(df)),
  setup = {
    set.seed(1234)
    df &lt;- as.data.frame(matrix(sample(0:5, 1e4*26, replace = TRUE), 1e4, 26))
    names(df) &lt;- LETTERS
  },
  times = 10
)

#Unit: seconds
#   expr      min       lq     mean   median       uq      max neval
# darren 1.017067 1.042890 1.116436 1.105676 1.203968 1.227238    10
#   mael 1.426412 1.518794 1.576665 1.581085 1.628206 1.713553    10

</details>



# 答案2
**得分**: 2

一种 `base` 选项:

```r
res <- df
n <- unlist(res)
res[res > 0] <- sapply(
  split(rep.int(names(df)[col(df)], n), rep.int(seq_along(n), n)),
  paste, collapse = ';'
)
res[res == 0] <- NA

res[1:5, 1:5]
#           A         B       C         D       E
# 1     A;A;A      <NA>   C;C;C D;D;D;D;D E;E;E;E
# 2         A B;B;B;B;B       C      <NA>   E;E;E
# 3 A;A;A;A;A B;B;B;B;B     C;C      <NA>       E
# 4   A;A;A;A   B;B;B;B    <NA>         D E;E;E;E
# 5     A;A;A         B C;C;C;C       D;D   E;E;E
英文:

One base option:

res &lt;- df
n &lt;- unlist(res)
res[res &gt; 0] &lt;- sapply(
  split(rep.int(names(df)[col(df)], n), rep.int(seq_along(n), n)),
  paste, collapse = &#39;;&#39;
)
res[res == 0] &lt;- NA

res[1:5, 1:5]
#           A         B       C         D       E
# 1     A;A;A      &lt;NA&gt;   C;C;C D;D;D;D;D E;E;E;E
# 2         A B;B;B;B;B       C      &lt;NA&gt;   E;E;E
# 3 A;A;A;A;A B;B;B;B;B     C;C      &lt;NA&gt;       E
# 4   A;A;A;A   B;B;B;B    &lt;NA&gt;         D E;E;E;E
# 5     A;A;A         B C;C;C;C       D;D   E;E;E

huangapple
  • 本文由 发表于 2023年7月20日 20:37:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76729966.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定