使用apply(.SD, 1, FUN)时data.table的输出不一致。

huangapple go评论56阅读模式
英文:

Inconsistent output of data.table using apply(.SD, 1, FUN)

问题

在R中使用data.table,我正在尝试连接两列并创建一个新列,其中包含上一步的唯一值。在下面的示例中,您可以看到代码在z1框架中运行正常,但在z2框架中出现错误。然而,这两个表是以相同的方式创建的。尽管这两列包含不同的信息,但这不应该是相同代码在z2上不起作用的原因。

感谢您的帮助,如果我表达不清楚,请告诉我。

最好的问候。

英文:

Working with data.table in R, I am trying to join two columns and create a new column in which I have the unique values of the previous step. In the example below you can see that the code works fine with frame z1, but with frame z2 I got an error. However, both tables were created in the same way. The columns have different information, but it should not be a reason for the same code not working on z2.

Thank you so much for your help, and please let me know if I am not clear.

Best,

library(data.table)


z1 <- data.table(a = c("ARE_2014_HIES_D1_INC_GROUP", "ARE_2014_HIES_D1_INC_GROUP"), 
                 b = c("ARE_2014_HIES_D1_INC_GROUP", "ARE_2015_HIES_D1_INC_GROUP"))


z2 <- data.table(a = c("ARG_1980_EPH_D2_INC_GROUP", "ARG_1980_EPH_D2_INC_GROUP"), 
                 b = c("ARG_1986_EPH_D2_INC_HIST", "ARG_1986_EPH_D2_INC_HIST"))


z1[,
   cache_id := as.list(apply(.SD, 1, unique)),
   .SDcols = c("a", "b")
]

z1[]
#>                             a                          b
#> 1: ARE_2014_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP
#> 2: ARE_2014_HIES_D1_INC_GROUP ARE_2015_HIES_D1_INC_GROUP
#>                                                 cache_id
#> 1:                            ARE_2014_HIES_D1_INC_GROUP
#> 2: ARE_2014_HIES_D1_INC_GROUP,ARE_2015_HIES_D1_INC_GROUP


z2[,
   cache_id := as.list(apply(.SD, 1, unique)),
   .SDcols = c("a", "b")
]
#> Error in `[.data.table`(z2, , `:=`(cache_id, as.list(apply(.SD, 1, unique))), : Supplied 4 items to be assigned to 2 items of column 'cache_id'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
z2[]
#>                            a                        b
#> 1: ARG_1980_EPH_D2_INC_GROUP ARG_1986_EPH_D2_INC_HIST
#> 2: ARG_1980_EPH_D2_INC_GROUP ARG_1986_EPH_D2_INC_HIST

<sup>Created on 2023-06-12 with reprex v2.0.2</sup>

答案1

得分: 2

apply函数返回矩阵,如果每个结果的长度相同,则返回矩阵,否则返回列表:

apply(z1[, .(a, b)], 1, unique)

[[1]]
[1] "ARE_2014_HIES_D1_INC_GROUP"

[[2]]
[1] "ARE_2014_HIES_D1_INC_GROUP" "ARE_2015_HIES_D1_INC_GROUP"

apply(z2[, .(a, b)], 1, unique)

     [,1]                        [,2]                       
[1,] "ARG_1980_EPH_D2_INC_GROUP" "ARG_1980_EPH_D2_INC_GROUP"
[2,] "ARG_1986_EPH_D2_INC_HIST"  "ARG_1986_EPH_D2_INC_HIST"

此外,对矩阵使用as.list不会返回列式列表,而是将每个元素作为列表的元素:

as.list(apply(z2[, .(a, b)], 1, unique))

[[1]]
[1] "ARG_1980_EPH_D2_INC_GROUP"

[[2]]
[1] "ARG_1986_EPH_D2_INC_HIST"

[[3]]
[1] "ARG_1980_EPH_D2_INC_GROUP"

[[4]]
[1] "ARG_1986_EPH_D2_INC_HIST"

因此会出现长度警告。

我不确定您的最终目标是什么,所以无法提供确切的答案。

英文:

apply returns matrix if each result is of same length otherwise list:

apply(z1[,.(a ,b)], 1, unique)

[[1]]
[1] &quot;ARE_2014_HIES_D1_INC_GROUP&quot;

[[2]]
[1] &quot;ARE_2014_HIES_D1_INC_GROUP&quot; &quot;ARE_2015_HIES_D1_INC_GROUP&quot;

apply(z2[, .(a, b), 1, unique)

     [,1]                        [,2]                       
[1,] &quot;ARG_1980_EPH_D2_INC_GROUP&quot; &quot;ARG_1980_EPH_D2_INC_GROUP&quot;
[2,] &quot;ARG_1986_EPH_D2_INC_HIST&quot;  &quot;ARG_1986_EPH_D2_INC_HIST&quot; 

Also, as.list on matrix does not give you column-wise list but u get each element as element of list:

as.list(apply(z2[, .(a, b)], 1, unique))

[[1]]
[1] &quot;ARG_1980_EPH_D2_INC_GROUP&quot;

[[2]]
[1] &quot;ARG_1986_EPH_D2_INC_HIST&quot;

[[3]]
[1] &quot;ARG_1980_EPH_D2_INC_GROUP&quot;

[[4]]
[1] &quot;ARG_1986_EPH_D2_INC_HIST&quot;

hence the warning for the length.

I'm not exactly sure what your end result should be so I can't provide definite answer.

答案2

得分: 1

你可以尝试以下方法:

z1[, cache_id := list(.(unique(c(a, b)))), 1:nrow(z1)]

类似地,对于 z2 也可以这样操作。

输出:

                            a                          b                                              cache_id
                       &lt;char&gt;                     &lt;char&gt;                                                &lt;list&gt;
1: ARE_2014_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP                            ARE_2014_HIES_D1_INC_GROUP
2: ARE_2014_HIES_D1_INC_GROUP ARE_2015_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP,ARE_2015_HIES_D1_INC_GROUP

请注意,代码部分不会被翻译。

英文:

You can try the following approach:

z1[, cache_id:=list(.(unique(c(a,b)))), 1:nrow(z1)]

and similarly for z2

Output:

                            a                          b                                              cache_id
                       &lt;char&gt;                     &lt;char&gt;                                                &lt;list&gt;
1: ARE_2014_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP                            ARE_2014_HIES_D1_INC_GROUP
2: ARE_2014_HIES_D1_INC_GROUP ARE_2015_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP,ARE_2015_HIES_D1_INC_GROUP

答案3

得分: 1

另一种方法,无需像@langtang的答案那样遍历行:

z1[, cache_id := lapply(.mapply(c, .SD, NULL), unique), .SDcols = c("a", "b")
   ][, cache_id := sapply(cache_id, paste, collapse = ", ")]
英文:

Another a approach which doesn't need to iterate over rows like @langtang's answer:

z1[, cache_id := lapply(.mapply(c, .SD, NULL), unique), .SDcols = c(&quot;a&quot;, &quot;b&quot;)
   ][, cache_id := sapply(cache_id, paste, collapse = &quot;, &quot;)]

答案4

得分: 1

这不清楚你为什么这样做,我怀疑我们在处理一个 xy 问题。无论如何,你几乎永远不需要遍历数据表的行。通常这是一个设计问题。如果你真的需要这样做,那么如果不是一次性的,或者你的数据表实际上很大,就转向 Rcpp。

无论如何,在具体的例子中,你可以使用 data.table::unique

library(data.table)
z1 &lt;- data.table(a = c(&quot;ARE_2014_HIES_D1_INC_GROUP&quot;, &quot;ARE_2014_HIES_D1_INC_GROUP&quot;), 
                b = c(&quot;ARE_2014_HIES_D1_INC_GROUP&quot;, &quot;ARE_2015_HIES_D1_INC_GROUP&quot;))
z1[, rn := .I]
unique(melt(z1, &quot;rn&quot;), by = c(&quot;rn&quot;, &quot;value&quot;))
#   rn variable                      value
#1:  1        a ARE_2014_HIES_D1_INC_GROUP
#2:  2        a ARE_2014_HIES_D1_INC_GROUP
#3:  2        b ARE_2015_HIES_D1_INC_GROUP

如果你必须这样做,然后你可以按 rn 拆分 value 列并将其添加到数据表。但再次,你为什么需要这样做?

英文:

It is unclear why you are doing this and I suspect we are dealing with an xy problem here. Anyway, you should almost never need to iterate over the rows of a data.table. Usually that's a design issue. If you really need to do it, then turn to Rcpp if it isn't a one-off or if your data.table is actually large.

Anyway, in the specific example, you can use data.table::unique:

library(data.table)
z1 &lt;- data.table(a = c(&quot;ARE_2014_HIES_D1_INC_GROUP&quot;, &quot;ARE_2014_HIES_D1_INC_GROUP&quot;), 
                b = c(&quot;ARE_2014_HIES_D1_INC_GROUP&quot;, &quot;ARE_2015_HIES_D1_INC_GROUP&quot;))
z1[, rn := .I]
unique(melt(z1, &quot;rn&quot;), by = c(&quot;rn&quot;, &quot;value&quot;))
#   rn variable                      value
#1:  1        a ARE_2014_HIES_D1_INC_GROUP
#2:  2        a ARE_2014_HIES_D1_INC_GROUP
#3:  2        b ARE_2015_HIES_D1_INC_GROUP

If you must, you can then split the value column by rn and add it to the data.table. But again, why would you need that?

答案5

得分: 0

感谢大家的回答。它们对我有很大帮助,我更多地了解了apply和data.table。我选择了@langtang的答案,因为它最快。但是,非常感谢@hieu-nguyen提供的两种解决方案。我认为simplify = FALSE是解决问题的关键,但你在评论中提到了这一点,我无法将其选为答案。请查看下面的性能基准测试:

library(data.table)

n <- 1e4
x <- sapply(1:n, \(x) sample(letters, 10) |&gt; paste(collapse = &quot;&quot;))
y <- sapply(1:n, \(x) sample(letters, 10) |&gt; paste(collapse = &quot;&quot;))

ni <- sample(1:n, floor(n/10), replace = FALSE)

x[ni] <- y[ni]

z1 <- data.table(a = x, 
                 b = y)

bench <- microbenchmark::microbenchmark(
  times = 30,
  simplify = z1[,
                cache_id := as.list(apply(.SD, 1, unique, simplify = FALSE)),
                .SDcols = c(&quot;a&quot;, &quot;b&quot;)],
  loop_rows = z1[, cache_id:=list(.(unique(c(a,b)))), 1:nrow(z1)], 
  mapply    = z1[, cache_id := lapply(.mapply(c, .SD, NULL), unique), .SDcols = c(&quot;a&quot;, &quot;b&quot;)]
)

bench
#&gt; Unit: milliseconds
#&gt;       expr       min        lq     mean   median       uq      max neval cld
#&gt;   simplify 145.03549 171.67624 209.7165 214.4948 244.1717 268.3255    30 a  
#&gt;  loop_rows  80.62317  98.74864 110.2403 106.6774 122.0016 148.4702    30  b 
#&gt;     mapply 337.39212 409.21162 482.0041 478.9344 544.5397 765.9302    30   c

创建于2023年06月13日,使用reprex v2.0.2

英文:

Thank you all for your answers. They really helped me out and I learned more about apply and data.table. I select @langtang answer because it is the fastest. Yet, Thak you so much @hieu-nguyen for both solutions. I think the simply = FALSE was the key to the problem, but you made that point in a comment, which I can't select as the answer. PLease, find below the benchmark

library(data.table)


n &lt;- 1e4
x &lt;- sapply(1:n, \(x) sample(letters, 10) |&gt; paste(collapse = &quot;&quot;))
y &lt;- sapply(1:n, \(x) sample(letters, 10) |&gt; paste(collapse = &quot;&quot;))

ni &lt;- sample(1:n, floor(n/10), replace = FALSE)

x[ni] &lt;- y[ni]



z1 &lt;- data.table(a = x, 
                 b = y)

bench &lt;- microbenchmark::microbenchmark(
  times = 30,
  simplify = z1[,
                cache_id := as.list(apply(.SD, 1, unique, simplify = FALSE)),
                .SDcols = c(&quot;a&quot;, &quot;b&quot;)],
  loop_rows = z1[, cache_id:=list(.(unique(c(a,b)))), 1:nrow(z1)], 
  mapply    = z1[, cache_id := lapply(.mapply(c, .SD, NULL), unique), .SDcols = c(&quot;a&quot;, &quot;b&quot;)]
)

bench
#&gt; Unit: milliseconds
#&gt;       expr       min        lq     mean   median       uq      max neval cld
#&gt;   simplify 145.03549 171.67624 209.7165 214.4948 244.1717 268.3255    30 a  
#&gt;  loop_rows  80.62317  98.74864 110.2403 106.6774 122.0016 148.4702    30  b 
#&gt;     mapply 337.39212 409.21162 482.0041 478.9344 544.5397 765.9302    30   c

<sup>Created on 2023-06-13 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年6月13日 08:09:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76460963.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定