英文:
Inconsistent output of data.table using apply(.SD, 1, FUN)
问题
在R中使用data.table
,我正在尝试连接两列并创建一个新列,其中包含上一步的唯一值。在下面的示例中,您可以看到代码在z1
框架中运行正常,但在z2
框架中出现错误。然而,这两个表是以相同的方式创建的。尽管这两列包含不同的信息,但这不应该是相同代码在z2
上不起作用的原因。
感谢您的帮助,如果我表达不清楚,请告诉我。
最好的问候。
英文:
Working with data.table
in R, I am trying to join two columns and create a new column in which I have the unique values of the previous step. In the example below you can see that the code works fine with frame z1
, but with frame z2
I got an error. However, both tables were created in the same way. The columns have different information, but it should not be a reason for the same code not working on z2
.
Thank you so much for your help, and please let me know if I am not clear.
Best,
library(data.table)
z1 <- data.table(a = c("ARE_2014_HIES_D1_INC_GROUP", "ARE_2014_HIES_D1_INC_GROUP"),
b = c("ARE_2014_HIES_D1_INC_GROUP", "ARE_2015_HIES_D1_INC_GROUP"))
z2 <- data.table(a = c("ARG_1980_EPH_D2_INC_GROUP", "ARG_1980_EPH_D2_INC_GROUP"),
b = c("ARG_1986_EPH_D2_INC_HIST", "ARG_1986_EPH_D2_INC_HIST"))
z1[,
cache_id := as.list(apply(.SD, 1, unique)),
.SDcols = c("a", "b")
]
z1[]
#> a b
#> 1: ARE_2014_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP
#> 2: ARE_2014_HIES_D1_INC_GROUP ARE_2015_HIES_D1_INC_GROUP
#> cache_id
#> 1: ARE_2014_HIES_D1_INC_GROUP
#> 2: ARE_2014_HIES_D1_INC_GROUP,ARE_2015_HIES_D1_INC_GROUP
z2[,
cache_id := as.list(apply(.SD, 1, unique)),
.SDcols = c("a", "b")
]
#> Error in `[.data.table`(z2, , `:=`(cache_id, as.list(apply(.SD, 1, unique))), : Supplied 4 items to be assigned to 2 items of column 'cache_id'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
z2[]
#> a b
#> 1: ARG_1980_EPH_D2_INC_GROUP ARG_1986_EPH_D2_INC_HIST
#> 2: ARG_1980_EPH_D2_INC_GROUP ARG_1986_EPH_D2_INC_HIST
<sup>Created on 2023-06-12 with reprex v2.0.2</sup>
答案1
得分: 2
apply
函数返回矩阵,如果每个结果的长度相同,则返回矩阵,否则返回列表:
apply(z1[, .(a, b)], 1, unique)
[[1]]
[1] "ARE_2014_HIES_D1_INC_GROUP"
[[2]]
[1] "ARE_2014_HIES_D1_INC_GROUP" "ARE_2015_HIES_D1_INC_GROUP"
apply(z2[, .(a, b)], 1, unique)
[,1] [,2]
[1,] "ARG_1980_EPH_D2_INC_GROUP" "ARG_1980_EPH_D2_INC_GROUP"
[2,] "ARG_1986_EPH_D2_INC_HIST" "ARG_1986_EPH_D2_INC_HIST"
此外,对矩阵使用as.list
不会返回列式列表,而是将每个元素作为列表的元素:
as.list(apply(z2[, .(a, b)], 1, unique))
[[1]]
[1] "ARG_1980_EPH_D2_INC_GROUP"
[[2]]
[1] "ARG_1986_EPH_D2_INC_HIST"
[[3]]
[1] "ARG_1980_EPH_D2_INC_GROUP"
[[4]]
[1] "ARG_1986_EPH_D2_INC_HIST"
因此会出现长度警告。
我不确定您的最终目标是什么,所以无法提供确切的答案。
英文:
apply
returns matrix if each result is of same length otherwise list:
apply(z1[,.(a ,b)], 1, unique)
[[1]]
[1] "ARE_2014_HIES_D1_INC_GROUP"
[[2]]
[1] "ARE_2014_HIES_D1_INC_GROUP" "ARE_2015_HIES_D1_INC_GROUP"
apply(z2[, .(a, b), 1, unique)
[,1] [,2]
[1,] "ARG_1980_EPH_D2_INC_GROUP" "ARG_1980_EPH_D2_INC_GROUP"
[2,] "ARG_1986_EPH_D2_INC_HIST" "ARG_1986_EPH_D2_INC_HIST"
Also, as.list
on matrix does not give you column-wise list but u get each element as element of list:
as.list(apply(z2[, .(a, b)], 1, unique))
[[1]]
[1] "ARG_1980_EPH_D2_INC_GROUP"
[[2]]
[1] "ARG_1986_EPH_D2_INC_HIST"
[[3]]
[1] "ARG_1980_EPH_D2_INC_GROUP"
[[4]]
[1] "ARG_1986_EPH_D2_INC_HIST"
hence the warning for the length.
I'm not exactly sure what your end result should be so I can't provide definite answer.
答案2
得分: 1
你可以尝试以下方法:
z1[, cache_id := list(.(unique(c(a, b)))), 1:nrow(z1)]
类似地,对于 z2
也可以这样操作。
输出:
a b cache_id
<char> <char> <list>
1: ARE_2014_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP
2: ARE_2014_HIES_D1_INC_GROUP ARE_2015_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP,ARE_2015_HIES_D1_INC_GROUP
请注意,代码部分不会被翻译。
英文:
You can try the following approach:
z1[, cache_id:=list(.(unique(c(a,b)))), 1:nrow(z1)]
and similarly for z2
Output:
a b cache_id
<char> <char> <list>
1: ARE_2014_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP
2: ARE_2014_HIES_D1_INC_GROUP ARE_2015_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP,ARE_2015_HIES_D1_INC_GROUP
答案3
得分: 1
另一种方法,无需像@langtang的答案那样遍历行:
z1[, cache_id := lapply(.mapply(c, .SD, NULL), unique), .SDcols = c("a", "b")
][, cache_id := sapply(cache_id, paste, collapse = ", ")]
英文:
Another a approach which doesn't need to iterate over rows like @langtang's answer:
z1[, cache_id := lapply(.mapply(c, .SD, NULL), unique), .SDcols = c("a", "b")
][, cache_id := sapply(cache_id, paste, collapse = ", ")]
答案4
得分: 1
这不清楚你为什么这样做,我怀疑我们在处理一个 xy 问题。无论如何,你几乎永远不需要遍历数据表的行。通常这是一个设计问题。如果你真的需要这样做,那么如果不是一次性的,或者你的数据表实际上很大,就转向 Rcpp。
无论如何,在具体的例子中,你可以使用 data.table::unique
:
library(data.table)
z1 <- data.table(a = c("ARE_2014_HIES_D1_INC_GROUP", "ARE_2014_HIES_D1_INC_GROUP"),
b = c("ARE_2014_HIES_D1_INC_GROUP", "ARE_2015_HIES_D1_INC_GROUP"))
z1[, rn := .I]
unique(melt(z1, "rn"), by = c("rn", "value"))
# rn variable value
#1: 1 a ARE_2014_HIES_D1_INC_GROUP
#2: 2 a ARE_2014_HIES_D1_INC_GROUP
#3: 2 b ARE_2015_HIES_D1_INC_GROUP
如果你必须这样做,然后你可以按 rn
拆分 value
列并将其添加到数据表。但再次,你为什么需要这样做?
英文:
It is unclear why you are doing this and I suspect we are dealing with an xy problem here. Anyway, you should almost never need to iterate over the rows of a data.table. Usually that's a design issue. If you really need to do it, then turn to Rcpp if it isn't a one-off or if your data.table is actually large.
Anyway, in the specific example, you can use data.table::unique
:
library(data.table)
z1 <- data.table(a = c("ARE_2014_HIES_D1_INC_GROUP", "ARE_2014_HIES_D1_INC_GROUP"),
b = c("ARE_2014_HIES_D1_INC_GROUP", "ARE_2015_HIES_D1_INC_GROUP"))
z1[, rn := .I]
unique(melt(z1, "rn"), by = c("rn", "value"))
# rn variable value
#1: 1 a ARE_2014_HIES_D1_INC_GROUP
#2: 2 a ARE_2014_HIES_D1_INC_GROUP
#3: 2 b ARE_2015_HIES_D1_INC_GROUP
If you must, you can then split the value
column by rn
and add it to the data.table. But again, why would you need that?
答案5
得分: 0
感谢大家的回答。它们对我有很大帮助,我更多地了解了apply和data.table。我选择了@langtang的答案,因为它最快。但是,非常感谢@hieu-nguyen提供的两种解决方案。我认为simplify = FALSE
是解决问题的关键,但你在评论中提到了这一点,我无法将其选为答案。请查看下面的性能基准测试:
library(data.table)
n <- 1e4
x <- sapply(1:n, \(x) sample(letters, 10) |> paste(collapse = ""))
y <- sapply(1:n, \(x) sample(letters, 10) |> paste(collapse = ""))
ni <- sample(1:n, floor(n/10), replace = FALSE)
x[ni] <- y[ni]
z1 <- data.table(a = x,
b = y)
bench <- microbenchmark::microbenchmark(
times = 30,
simplify = z1[,
cache_id := as.list(apply(.SD, 1, unique, simplify = FALSE)),
.SDcols = c("a", "b")],
loop_rows = z1[, cache_id:=list(.(unique(c(a,b)))), 1:nrow(z1)],
mapply = z1[, cache_id := lapply(.mapply(c, .SD, NULL), unique), .SDcols = c("a", "b")]
)
bench
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> simplify 145.03549 171.67624 209.7165 214.4948 244.1717 268.3255 30 a
#> loop_rows 80.62317 98.74864 110.2403 106.6774 122.0016 148.4702 30 b
#> mapply 337.39212 409.21162 482.0041 478.9344 544.5397 765.9302 30 c
创建于2023年06月13日,使用reprex v2.0.2
英文:
Thank you all for your answers. They really helped me out and I learned more about apply and data.table. I select @langtang answer because it is the fastest. Yet, Thak you so much @hieu-nguyen for both solutions. I think the simply = FALSE
was the key to the problem, but you made that point in a comment, which I can't select as the answer. PLease, find below the benchmark
library(data.table)
n <- 1e4
x <- sapply(1:n, \(x) sample(letters, 10) |> paste(collapse = ""))
y <- sapply(1:n, \(x) sample(letters, 10) |> paste(collapse = ""))
ni <- sample(1:n, floor(n/10), replace = FALSE)
x[ni] <- y[ni]
z1 <- data.table(a = x,
b = y)
bench <- microbenchmark::microbenchmark(
times = 30,
simplify = z1[,
cache_id := as.list(apply(.SD, 1, unique, simplify = FALSE)),
.SDcols = c("a", "b")],
loop_rows = z1[, cache_id:=list(.(unique(c(a,b)))), 1:nrow(z1)],
mapply = z1[, cache_id := lapply(.mapply(c, .SD, NULL), unique), .SDcols = c("a", "b")]
)
bench
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> simplify 145.03549 171.67624 209.7165 214.4948 244.1717 268.3255 30 a
#> loop_rows 80.62317 98.74864 110.2403 106.6774 122.0016 148.4702 30 b
#> mapply 337.39212 409.21162 482.0041 478.9344 544.5397 765.9302 30 c
<sup>Created on 2023-06-13 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论