2023年7月20日 20:37:25go评论98阅读模式

英文:

Replicate column names and splice

问题

以下是翻译好的部分：

我有一个如下的数据框：
df <- data.frame(A = c(2, 0, 1), B = c(0, 3, 2))
#   A B
# 1 2 0
# 2 0 3
# 3 1 2

每个单元格中的数字表示相应列名应重复的次数。重复部分应该由分号(;)拼接成单个字符串。预期的输出如下：

#      A     B
# 1  A;A  <NA>
# 2 <NA> B;B;B
# 3    A   B;B

我正在寻找一种处理更大数据集的高效方法：

set.seed(1234)
df <- as.data.frame(matrix(sample(0:5, 1e4*26, replace = TRUE), 1e4, 26))
names(df) <- LETTERS
#    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 1  3 0 3 5 4 0 3 2 2 1 3 3 4 3 2 4 0 1 4 5 2 5 5 2 0 0
# 2  1 5 1 0 3 3 2 0 1 5 5 2 5 0 2 5 1 1 2 4 5 5 0 5 0 0
# 3  5 5 2 0 1 4 5 4 0 5 5 1 1 1 2 2 4 5 4 5 5 5 0 4 0 0
# ...
# [ reached 'max' / getOption("max.print") -- omitted 9997 rows ]

我更喜欢使用base或tidyverse的解决方案。欢迎使用data.table，尽管我对它不太熟悉。


<details>
<summary>英文:</summary>
I have a dataframe as follows:
```r
df &lt;- data.frame(A = c(2, 0, 1), B = c(0, 3, 2))
#   A B
# 1 2 0
# 2 0 3
# 3 1 2

The number in each cell indicates the times for which the corresponding column name should repeat. The replicates should be spliced by semicolons(;) to a single string. The expected output turns out to

#      A     B
# 1  A;A  &lt;NA&gt;
# 2 &lt;NA&gt; B;B;B
# 3    A   B;B

I'm searching a efficient way to deal with a much larger dataset:

set.seed(1234)
df &lt;- as.data.frame(matrix(sample(0:5, 1e4*26, replace = TRUE), 1e4, 26))
names(df) &lt;- LETTERS
#    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 1  3 0 3 5 4 0 3 2 2 1 3 3 4 3 2 4 0 1 4 5 2 5 5 2 0 0
# 2  1 5 1 0 3 3 2 0 1 5 5 2 5 0 2 5 1 1 2 4 5 5 0 5 0 0
# 3  5 5 2 0 1 4 5 4 0 5 5 1 1 1 2 2 4 5 4 5 5 5 0 4 0 0
# ...
# [ reached &#39;max&#39; / getOption(&quot;max.print&quot;) -- omitted 9997 rows ]

I prefer base or tidyverse solutions. data.table is welcome but I'm unfamiliar to it though.

答案1

得分: 3

以下是已翻译的代码部分：

在R中的基本选项：

df[] <- mapply(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = ";"))),
       df, names(df))
#      A     B
# 1  A;A  <NA>
# 2 <NA> B;B;B
# 3    A   B;B

转换为tidyverse：

library(purrr)
library(dplyr)
df %>%
  imap_dfr(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = ";"))))

另一个dplyr选项（但可能较慢）：

df %>%
  rowwise() %>%
  mutate(across(everything(), ~ ifelse(.x == 0, NA, paste(rep(rep(cur_column(), n()), .x), collapse = ";"))))

基准测试结果：

library(microbenchmark)
mb <- microbenchmark(
  darren = {res <- df
  n <- unlist(res)
  res[res > 0] <- sapply(
    split(rep.int(names(df)[col(df)], n), rep.int(seq_along(n), n)),
    paste, collapse = ';'
  )
  res[res == 0] <- NA},
  mael = mapply(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = ";"))),
                df, names(df)),
  setup = {
    set.seed(1234)
    df <- as.data.frame(matrix(sample(0:5, 1e4*26, replace = TRUE), 1e4, 26))
    names(df) <- LETTERS
  },
  times = 10
)
#Unit: seconds
#   expr      min       lq     mean   median       uq      max neval
# darren 1.017067 1.042890 1.116436 1.105676 1.203968 1.227238    10
#   mael 1.426412 1.518794 1.576665 1.581085 1.628206 1.713553    10

英文:

Here is a base R option:

df[] &lt;- mapply(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = &quot;;&quot;))),
       df, names(df))
#      A     B
# 1  A;A  &lt;NA&gt;
# 2 &lt;NA&gt; B;B;B
# 3    A   B;B

And converted to tidyverse:

library(purrr)
library(dplyr)
df %&gt;% 
 imap_dfr(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = &quot;;&quot;))))

Another dplyr option (but probably slow):

df %&gt;% 
  rowwise() %&gt;% 
  mutate(across(everything(), ~ ifelse(.x == 0, NA, paste(rep(rep(cur_column(), n()), .x), collapse = &quot;;&quot;))))

Benchmark:

library(microbenchmark)
mb &lt;- microbenchmark(
  darren = {res &lt;- df
  n &lt;- unlist(res)
  res[res &gt; 0] &lt;- sapply(
    split(rep.int(names(df)[col(df)], n), rep.int(seq_along(n), n)),
    paste, collapse = &#39;;&#39;
  )
  res[res == 0] &lt;- NA},
  mael = mapply(\(x, y) sapply(x, \(z) ifelse(z == 0, NA, paste(rep(y, z), collapse = &quot;;&quot;))),
                df, names(df)),
  setup = {
    set.seed(1234)
    df &lt;- as.data.frame(matrix(sample(0:5, 1e4*26, replace = TRUE), 1e4, 26))
    names(df) &lt;- LETTERS
  },
  times = 10
)
#Unit: seconds
#   expr      min       lq     mean   median       uq      max neval
# darren 1.017067 1.042890 1.116436 1.105676 1.203968 1.227238    10
#   mael 1.426412 1.518794 1.576665 1.581085 1.628206 1.713553    10
</details>
# 答案2
**得分**: 2
一种 `base` 选项：
```r
res <- df
n <- unlist(res)
res[res > 0] <- sapply(
  split(rep.int(names(df)[col(df)], n), rep.int(seq_along(n), n)),
  paste, collapse = ';'
)
res[res == 0] <- NA
res[1:5, 1:5]
#           A         B       C         D       E
# 1     A;A;A      <NA>   C;C;C D;D;D;D;D E;E;E;E
# 2         A B;B;B;B;B       C      <NA>   E;E;E
# 3 A;A;A;A;A B;B;B;B;B     C;C      <NA>       E
# 4   A;A;A;A   B;B;B;B    <NA>         D E;E;E;E
# 5     A;A;A         B C;C;C;C       D;D   E;E;E

英文:

One base option:

res &lt;- df
n &lt;- unlist(res)
res[res &gt; 0] &lt;- sapply(
  split(rep.int(names(df)[col(df)], n), rep.int(seq_along(n), n)),
  paste, collapse = &#39;;&#39;
)
res[res == 0] &lt;- NA
res[1:5, 1:5]
#           A         B       C         D       E
# 1     A;A;A      &lt;NA&gt;   C;C;C D;D;D;D;D E;E;E;E
# 2         A B;B;B;B;B       C      &lt;NA&gt;   E;E;E
# 3 A;A;A;A;A B;B;B;B;B     C;C      &lt;NA&gt;       E
# 4   A;A;A;A   B;B;B;B    &lt;NA&gt;         D E;E;E;E
# 5     A;A;A         B C;C;C;C       D;D   E;E;E

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

复制列名并拼接

问题

答案1

Python pandas: 如何在两个数据框之间匹配数据

如何将另一列的值初始化后添加到数据框中的对象列？

随机抽样数据，基于其他列使用Python。

计算嵌套/分割对象上的基尼指数

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。