2023年6月13日 08:09:17go评论95阅读模式

英文:

Inconsistent output of data.table using apply(.SD, 1, FUN)

问题

在R中使用data.table，我正在尝试连接两列并创建一个新列，其中包含上一步的唯一值。在下面的示例中，您可以看到代码在z1框架中运行正常，但在z2框架中出现错误。然而，这两个表是以相同的方式创建的。尽管这两列包含不同的信息，但这不应该是相同代码在z2上不起作用的原因。

感谢您的帮助，如果我表达不清楚，请告诉我。

最好的问候。

英文:

Working with data.table in R, I am trying to join two columns and create a new column in which I have the unique values of the previous step. In the example below you can see that the code works fine with frame z1, but with frame z2 I got an error. However, both tables were created in the same way. The columns have different information, but it should not be a reason for the same code not working on z2.

Thank you so much for your help, and please let me know if I am not clear.

Best,

library(data.table)
z1 &lt;- data.table(a = c(&quot;ARE_2014_HIES_D1_INC_GROUP&quot;, &quot;ARE_2014_HIES_D1_INC_GROUP&quot;), 
                 b = c(&quot;ARE_2014_HIES_D1_INC_GROUP&quot;, &quot;ARE_2015_HIES_D1_INC_GROUP&quot;))
z2 &lt;- data.table(a = c(&quot;ARG_1980_EPH_D2_INC_GROUP&quot;, &quot;ARG_1980_EPH_D2_INC_GROUP&quot;), 
                 b = c(&quot;ARG_1986_EPH_D2_INC_HIST&quot;, &quot;ARG_1986_EPH_D2_INC_HIST&quot;))
z1[,
   cache_id := as.list(apply(.SD, 1, unique)),
   .SDcols = c(&quot;a&quot;, &quot;b&quot;)
]
z1[]
#&gt;                             a                          b
#&gt; 1: ARE_2014_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP
#&gt; 2: ARE_2014_HIES_D1_INC_GROUP ARE_2015_HIES_D1_INC_GROUP
#&gt;                                                 cache_id
#&gt; 1:                            ARE_2014_HIES_D1_INC_GROUP
#&gt; 2: ARE_2014_HIES_D1_INC_GROUP,ARE_2015_HIES_D1_INC_GROUP
z2[,
   cache_id := as.list(apply(.SD, 1, unique)),
   .SDcols = c(&quot;a&quot;, &quot;b&quot;)
]
#&gt; Error in `[.data.table`(z2, , `:=`(cache_id, as.list(apply(.SD, 1, unique))), : Supplied 4 items to be assigned to 2 items of column &#39;cache_id&#39;. If you wish to &#39;recycle&#39; the RHS please use rep() to make this intent clear to readers of your code.
z2[]
#&gt;                            a                        b
#&gt; 1: ARG_1980_EPH_D2_INC_GROUP ARG_1986_EPH_D2_INC_HIST
#&gt; 2: ARG_1980_EPH_D2_INC_GROUP ARG_1986_EPH_D2_INC_HIST

<sup>Created on 2023-06-12 with reprex v2.0.2</sup>

答案1

得分: 2

apply函数返回矩阵，如果每个结果的长度相同，则返回矩阵，否则返回列表：

apply(z1[, .(a, b)], 1, unique)
[[1]]
[1] "ARE_2014_HIES_D1_INC_GROUP"
[[2]]
[1] "ARE_2014_HIES_D1_INC_GROUP" "ARE_2015_HIES_D1_INC_GROUP"
apply(z2[, .(a, b)], 1, unique)
     [,1]                        [,2]                       
[1,] "ARG_1980_EPH_D2_INC_GROUP" "ARG_1980_EPH_D2_INC_GROUP"
[2,] "ARG_1986_EPH_D2_INC_HIST"  "ARG_1986_EPH_D2_INC_HIST"

此外，对矩阵使用as.list不会返回列式列表，而是将每个元素作为列表的元素：

as.list(apply(z2[, .(a, b)], 1, unique))
[[1]]
[1] "ARG_1980_EPH_D2_INC_GROUP"
[[2]]
[1] "ARG_1986_EPH_D2_INC_HIST"
[[3]]
[1] "ARG_1980_EPH_D2_INC_GROUP"
[[4]]
[1] "ARG_1986_EPH_D2_INC_HIST"
因此会出现长度警告。

我不确定您的最终目标是什么，所以无法提供确切的答案。

英文:

apply returns matrix if each result is of same length otherwise list:

apply(z1[,.(a ,b)], 1, unique)
[[1]]
[1] &quot;ARE_2014_HIES_D1_INC_GROUP&quot;
[[2]]
[1] &quot;ARE_2014_HIES_D1_INC_GROUP&quot; &quot;ARE_2015_HIES_D1_INC_GROUP&quot;
apply(z2[, .(a, b), 1, unique)
     [,1]                        [,2]                       
[1,] &quot;ARG_1980_EPH_D2_INC_GROUP&quot; &quot;ARG_1980_EPH_D2_INC_GROUP&quot;
[2,] &quot;ARG_1986_EPH_D2_INC_HIST&quot;  &quot;ARG_1986_EPH_D2_INC_HIST&quot;

Also, as.list on matrix does not give you column-wise list but u get each element as element of list:

as.list(apply(z2[, .(a, b)], 1, unique))
[[1]]
[1] &quot;ARG_1980_EPH_D2_INC_GROUP&quot;
[[2]]
[1] &quot;ARG_1986_EPH_D2_INC_HIST&quot;
[[3]]
[1] &quot;ARG_1980_EPH_D2_INC_GROUP&quot;
[[4]]
[1] &quot;ARG_1986_EPH_D2_INC_HIST&quot;

hence the warning for the length.

I'm not exactly sure what your end result should be so I can't provide definite answer.

答案2

得分: 1

你可以尝试以下方法：

z1[, cache_id := list(.(unique(c(a, b)))), 1:nrow(z1)]

类似地，对于 z2 也可以这样操作。

输出：

                            a                          b                                              cache_id
                       &lt;char&gt;                     &lt;char&gt;                                                &lt;list&gt;
1: ARE_2014_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP                            ARE_2014_HIES_D1_INC_GROUP
2: ARE_2014_HIES_D1_INC_GROUP ARE_2015_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP,ARE_2015_HIES_D1_INC_GROUP

请注意，代码部分不会被翻译。

英文:

You can try the following approach:

z1[, cache_id:=list(.(unique(c(a,b)))), 1:nrow(z1)]

and similarly for z2

Output:

                            a                          b                                              cache_id
                       &lt;char&gt;                     &lt;char&gt;                                                &lt;list&gt;
1: ARE_2014_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP                            ARE_2014_HIES_D1_INC_GROUP
2: ARE_2014_HIES_D1_INC_GROUP ARE_2015_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP,ARE_2015_HIES_D1_INC_GROUP

答案3

得分: 1

另一种方法，无需像@langtang的答案那样遍历行：

z1[, cache_id := lapply(.mapply(c, .SD, NULL), unique), .SDcols = c("a", "b")
   ][, cache_id := sapply(cache_id, paste, collapse = ", ")]

英文:

Another a approach which doesn't need to iterate over rows like @langtang's answer:

z1[, cache_id := lapply(.mapply(c, .SD, NULL), unique), .SDcols = c(&quot;a&quot;, &quot;b&quot;)
   ][, cache_id := sapply(cache_id, paste, collapse = &quot;, &quot;)]

答案4

得分: 1

这不清楚你为什么这样做，我怀疑我们在处理一个 xy 问题。无论如何，你几乎永远不需要遍历数据表的行。通常这是一个设计问题。如果你真的需要这样做，那么如果不是一次性的，或者你的数据表实际上很大，就转向 Rcpp。

无论如何，在具体的例子中，你可以使用 data.table::unique：

library(data.table)
z1 &lt;- data.table(a = c(&quot;ARE_2014_HIES_D1_INC_GROUP&quot;, &quot;ARE_2014_HIES_D1_INC_GROUP&quot;), 
                b = c(&quot;ARE_2014_HIES_D1_INC_GROUP&quot;, &quot;ARE_2015_HIES_D1_INC_GROUP&quot;))
z1[, rn := .I]
unique(melt(z1, &quot;rn&quot;), by = c(&quot;rn&quot;, &quot;value&quot;))
#   rn variable                      value
#1:  1        a ARE_2014_HIES_D1_INC_GROUP
#2:  2        a ARE_2014_HIES_D1_INC_GROUP
#3:  2        b ARE_2015_HIES_D1_INC_GROUP

如果你必须这样做，然后你可以按 rn 拆分 value 列并将其添加到数据表。但再次，你为什么需要这样做？

英文:

It is unclear why you are doing this and I suspect we are dealing with an xy problem here. Anyway, you should almost never need to iterate over the rows of a data.table. Usually that's a design issue. If you really need to do it, then turn to Rcpp if it isn't a one-off or if your data.table is actually large.

Anyway, in the specific example, you can use data.table::unique:

library(data.table)
z1 &lt;- data.table(a = c(&quot;ARE_2014_HIES_D1_INC_GROUP&quot;, &quot;ARE_2014_HIES_D1_INC_GROUP&quot;), 
                b = c(&quot;ARE_2014_HIES_D1_INC_GROUP&quot;, &quot;ARE_2015_HIES_D1_INC_GROUP&quot;))
z1[, rn := .I]
unique(melt(z1, &quot;rn&quot;), by = c(&quot;rn&quot;, &quot;value&quot;))
#   rn variable                      value
#1:  1        a ARE_2014_HIES_D1_INC_GROUP
#2:  2        a ARE_2014_HIES_D1_INC_GROUP
#3:  2        b ARE_2015_HIES_D1_INC_GROUP

If you must, you can then split the value column by rn and add it to the data.table. But again, why would you need that?

答案5

得分: 0

感谢大家的回答。它们对我有很大帮助，我更多地了解了apply和data.table。我选择了@langtang的答案，因为它最快。但是，非常感谢@hieu-nguyen提供的两种解决方案。我认为simplify = FALSE是解决问题的关键，但你在评论中提到了这一点，我无法将其选为答案。请查看下面的性能基准测试：

library(data.table)
n <- 1e4
x <- sapply(1:n, \(x) sample(letters, 10) |&gt; paste(collapse = &quot;&quot;))
y <- sapply(1:n, \(x) sample(letters, 10) |&gt; paste(collapse = &quot;&quot;))
ni <- sample(1:n, floor(n/10), replace = FALSE)
x[ni] <- y[ni]
z1 <- data.table(a = x, 
                 b = y)
bench <- microbenchmark::microbenchmark(
  times = 30,
  simplify = z1[,
                cache_id := as.list(apply(.SD, 1, unique, simplify = FALSE)),
                .SDcols = c(&quot;a&quot;, &quot;b&quot;)],
  loop_rows = z1[, cache_id:=list(.(unique(c(a,b)))), 1:nrow(z1)], 
  mapply    = z1[, cache_id := lapply(.mapply(c, .SD, NULL), unique), .SDcols = c(&quot;a&quot;, &quot;b&quot;)]
)
bench
#&gt; Unit: milliseconds
#&gt;       expr       min        lq     mean   median       uq      max neval cld
#&gt;   simplify 145.03549 171.67624 209.7165 214.4948 244.1717 268.3255    30 a  
#&gt;  loop_rows  80.62317  98.74864 110.2403 106.6774 122.0016 148.4702    30  b 
#&gt;     mapply 337.39212 409.21162 482.0041 478.9344 544.5397 765.9302    30   c

^{创建于2023年06月13日，使用reprex v2.0.2}

英文:

Thank you all for your answers. They really helped me out and I learned more about apply and data.table. I select @langtang answer because it is the fastest. Yet, Thak you so much @hieu-nguyen for both solutions. I think the simply = FALSE was the key to the problem, but you made that point in a comment, which I can't select as the answer. PLease, find below the benchmark

library(data.table)
n &lt;- 1e4
x &lt;- sapply(1:n, \(x) sample(letters, 10) |&gt; paste(collapse = &quot;&quot;))
y &lt;- sapply(1:n, \(x) sample(letters, 10) |&gt; paste(collapse = &quot;&quot;))
ni &lt;- sample(1:n, floor(n/10), replace = FALSE)
x[ni] &lt;- y[ni]
z1 &lt;- data.table(a = x, 
                 b = y)
bench &lt;- microbenchmark::microbenchmark(
  times = 30,
  simplify = z1[,
                cache_id := as.list(apply(.SD, 1, unique, simplify = FALSE)),
                .SDcols = c(&quot;a&quot;, &quot;b&quot;)],
  loop_rows = z1[, cache_id:=list(.(unique(c(a,b)))), 1:nrow(z1)], 
  mapply    = z1[, cache_id := lapply(.mapply(c, .SD, NULL), unique), .SDcols = c(&quot;a&quot;, &quot;b&quot;)]
)
bench
#&gt; Unit: milliseconds
#&gt;       expr       min        lq     mean   median       uq      max neval cld
#&gt;   simplify 145.03549 171.67624 209.7165 214.4948 244.1717 268.3255    30 a  
#&gt;  loop_rows  80.62317  98.74864 110.2403 106.6774 122.0016 148.4702    30  b 
#&gt;     mapply 337.39212 409.21162 482.0041 478.9344 544.5397 765.9302    30   c

<sup>Created on 2023-06-13 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用apply(.SD, 1, FUN)时data.table的输出不一致。

问题

答案1

答案2

答案3

答案4

答案5

如何基于多个参数传递不同行的代码在R中？

合并具有相同结构的列表元素。

如何在R中为每列获取单侧95%置信区间？

在R中查找属于特定组的列之间的共同行。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论