高效地匹配一个向量中的所有值与另一个向量中的值。

huangapple go评论96阅读模式
英文:

Efficiently match all values of a vector in another vector

问题

我正在寻找一种高效的方法,可以匹配向量x中的所有值,而不仅仅是第一个位置,就像match()返回的那样。实际上,我想要的是pmatch()的默认行为,但不进行部分匹配:

  1. x <- c(3L, 1L, 2L, 3L, 3L, 2L)
  2. y <- c(3L, 3L, 3L, 3L, 1L, 3L)

期望的输出:

  1. pmatch(x, y)
  2. [1] 1 5 NA 2 3 NA

一种方法是使用ave(),但随着组数的增加,这会变得很慢,并且非常占用内存:

  1. ave(x, x, FUN = \(v) which(y == v[1])[1:length(v)])
  2. [1] 1 5 NA 2 3 NA

是否有人能推荐一种在最好的情况下(但不是必须的)使用基本R来实现这一目标的高效方法?

用于基准测试的较大数据集:

  1. set.seed(5)
  2. x <- sample(5e3, 1e5, replace = TRUE)
  3. y <- sample(x, replace = TRUE)
英文:

I'm looking to find an efficient method of matching all values of vector x in vector y rather than just the first position, as is returned by match(). What I'm after essentially is the default behavior of pmatch() but without partial matching:

  1. x &lt;- c(3L, 1L, 2L, 3L, 3L, 2L)
  2. y &lt;- c(3L, 3L, 3L, 3L, 1L, 3L)

Expected output:

  1. pmatch(x, y)
  2. [1] 1 5 NA 2 3 NA

One way is to use ave() however this becomes slow and very memory inefficient as the number of groups increases:

  1. ave(x, x, FUN = \(v) which(y == v[1])[1:length(v)])
  2. [1] 1 5 NA 2 3 NA

Can anyone recommend an efficient way to achieve this in preferably (but not mandatory) base R?

Larger dataset for benchmarking:

  1. set.seed(5)
  2. x &lt;- sample(5e3, 1e5, replace = TRUE)
  3. y &lt;- sample(x, replace = TRUE)

答案1

得分: 9

在这个代码部分中,你尝试了不同的方法来处理两个向量的匹配和索引问题。以下是你提到的几个部分的翻译:

  1. base 中使用 split 进行变体。
  1. x <- c(3L, 1L, 2L, 3L, 3L, 2L)
  2. y <- c(3L, 3L, 3L, 3L, 1L, 3L)
  3. a <- split(seq_along(x), x)
  4. b <- split(seq_along(y), y)[names(a)]
  5. b[lengths(b) == 0] <- NA
  6. b <- unlist(Map(`length<-`, b, lengths(a)), FALSE, FALSE)
  7. `[<-`(b, unlist(a, FALSE, FALSE), b)
  8. #[1] 1 5 NA 2 3 NA
  1. 你尝试用 RCPP 版本来优化匹配。
  1. Rcpp::sourceCpp(code="
  2. #include <Rcpp.h>
  3. #include <unordered_map>
  4. #include <queue>
  5. using namespace Rcpp;
  6. // [[Rcpp::export]]
  7. IntegerVector pm(const std::vector<int>& a, const std::vector<int>& b) {
  8. IntegerVector idx(no_init(a.size()));
  9. std::unordered_map<int, std::queue<int> > lut;
  10. for(int i = 0; i < b.size(); ++i) lut[b[i]].push(i);
  11. for(int i = 0; i < idx.size(); ++i) {
  12. auto search = lut.find(a[i]);
  13. if(search != lut.end() && search->second.size() > 0) {
  14. idx[i] = search->second.front() + 1;
  15. search->second.pop();
  16. } else {idx[i] = NA_INTEGER;}
  17. }
  18. return idx;
  19. }
  20. )")
  21. pm(x, y)
  22. #[1] 1 5 NA 2 3 NA
  1. 优化的 RCPP 版本。
  1. Rcpp::sourceCpp(code="
  2. #include <Rcpp.h>
  3. #include <vector>
  4. #include <array>
  5. #include <queue>
  6. #include <algorithm>
  7. using namespace Rcpp;
  8. // [[Rcpp::export]]
  9. IntegerVector pm2(const std::vector<int>& a, const std::vector<int>& b) {
  10. IntegerVector idx(no_init(a.size()));
  11. int max = 1 + *std::max_element(a.begin(), a.end());
  12. std::vector<int> n(max);
  13. for(int i = 0; i < a.size(); ++i) ++n[a[i]];
  14. std::vector<std::queue<int> > lut(max);
  15. for(int i = 0; i < b.size(); ++i) {
  16. if(b[i] < max && n[b[i]] > 0) {
  17. --n[b[i]];
  18. lut[b[i]].push(i);
  19. }
  20. }
  21. for(int i = 0; i < idx.size(); ++i) {
  22. auto & P = lut[a[i]];
  23. if(P.size() > 0) {
  24. idx[i] = P.front() + 1;
  25. P.pop();
  26. } else {idx[i] = NA_INTEGER;}
  27. }
  28. return idx;
  29. }
  30. )")
  31. pm2(x, y)
  32. #[1] 1 5 NA 2 3 NA

请注意,这些代码段都是用于优化匹配的不同方法,包括使用了 baseRCPP 和优化的 RCPP 版本。如果你需要更多的翻译或解释,请随时提出。

英文:

A variant in base using split.
split the indices of both vectors by its value. Subset the second list with the names of the first, that both have the same order. Change NULL to NA and bring the lengths of the second list to those from the first. Reorder the indices of the second list by those of the first.

  1. x &lt;- c(3L, 1L, 2L, 3L, 3L, 2L)
  2. y &lt;- c(3L, 3L, 3L, 3L, 1L, 3L)
  3. a &lt;- split(seq_along(x), x)
  4. b &lt;- split(seq_along(y), y)[names(a)]
  5. b[lengths(b)==0] &lt;- NA
  6. b &lt;- unlist(Map(`length&lt;-`, b, lengths(a)), FALSE, FALSE)
  7. `[&lt;-`(b, unlist(a, FALSE, FALSE), b)
  8. #[1] 1 5 NA 2 3 NA

I tried to exchange the part

  1. b &lt;- split(seq_along(y), y)[names(a)]
  2. b[lengths(b)==0] &lt;- NA

with

  1. b &lt;- list2env(split(seq_along(y), y))
  2. b &lt;- mget(names(a), b, ifnotfound = NA)

But it was not faster.

An RCPP version.
Store the indices of the second vector ín a queue for each unique value in an unordered_map. Iterate over all values of the first vector and take the indices from the queue.

  1. Rcpp::sourceCpp(code=r&quot;(
  2. #include &lt;Rcpp.h&gt;
  3. #include &lt;unordered_map&gt;
  4. #include &lt;queue&gt;
  5. using namespace Rcpp;
  6. // [[Rcpp::export]]
  7. IntegerVector pm(const std::vector&lt;int&gt;&amp; a, const std::vector&lt;int&gt;&amp; b) {
  8. IntegerVector idx(no_init(a.size()));
  9. std::unordered_map&lt;int, std::queue&lt;int&gt; &gt; lut;
  10. for(int i = 0; i &lt; b.size(); ++i) lut[b[i]].push(i);
  11. for(int i = 0; i &lt; idx.size(); ++i) {
  12. auto search = lut.find(a[i]);
  13. if(search != lut.end() &amp;&amp; search-&gt;second.size() &gt; 0) {
  14. idx[i] = search-&gt;second.front() + 1;
  15. search-&gt;second.pop();
  16. } else {idx[i] = NA_INTEGER;}
  17. }
  18. return idx;
  19. }
  20. )&quot;)
  21. pm(x, y)
  22. #[1] 1 5 NA 2 3 NA

A for this case specialized RCPP version.
Create a vector of the length of the maximum value of the first vector and count how many times a value is present. Create another queue vector of the same length and sore there the indices of the values of the second vector until it has reached the number of the first. Iterate over all values of the first vector and take the indices from the queue.

  1. Rcpp::sourceCpp(code=r&quot;(
  2. #include &lt;Rcpp.h&gt;
  3. #include &lt;vector&gt;
  4. #include &lt;array&gt;
  5. #include &lt;queue&gt;
  6. #include &lt;algorithm&gt;
  7. using namespace Rcpp;
  8. // [[Rcpp::export]]
  9. IntegerVector pm2(const std::vector&lt;int&gt;&amp; a, const std::vector&lt;int&gt;&amp; b) {
  10. IntegerVector idx(no_init(a.size()));
  11. int max = 1 + *std::max_element(a.begin(), a.end());
  12. std::vector&lt;int&gt; n(max);
  13. for(int i = 0; i &lt; a.size(); ++i) ++n[a[i]];
  14. std::vector&lt;std::queue&lt;int&gt; &gt; lut(max);
  15. for(int i = 0; i &lt; b.size(); ++i) {
  16. if(b[i] &lt; max &amp;&amp; n[b[i]] &gt; 0) {
  17. --n[b[i]];
  18. lut[b[i]].push(i);
  19. }
  20. }
  21. for(int i = 0; i &lt; idx.size(); ++i) {
  22. auto &amp; P = lut[a[i]];
  23. if(P.size() &gt; 0) {
  24. idx[i] = P.front() + 1;
  25. P.pop();
  26. } else {idx[i] = NA_INTEGER;}
  27. }
  28. return idx;
  29. }
  30. )&quot;)
  31. pm2(x,y)
  32. #[1] 1 5 NA 2 3 NA

Benchmark

  1. set.seed(5)
  2. x &lt;- sample(5e3, 1e5, replace = TRUE)
  3. y &lt;- sample(x, replace = TRUE)
  4. library(data.table)
  5. matchall &lt;- function(x, y) {
  6. data.table(y, rowid(y))[
  7. data.table(x, rowid(x)), on = .(y = x, V2), which = TRUE
  8. ]
  9. }
  10. rmatch &lt;- function(x, y) {
  11. xp &lt;- cbind(seq_along(x), x)[order(x),]
  12. yp &lt;- cbind(seq_along(y), y)[order(y),]
  13. result &lt;- numeric(length(x))
  14. xi &lt;- yi &lt;- 1
  15. Nx &lt;- length(x)
  16. Ny &lt;- length(y)
  17. while (xi &lt;= Nx) {
  18. if (yi &gt; Ny) {
  19. result[xp[xi,1]] &lt;- NA
  20. xi &lt;- xi + 1
  21. } else if (xp[xi,2] == yp[yi,2]) {
  22. result[xp[xi,1]] = yp[yi,1]
  23. xi &lt;- xi + 1
  24. yi &lt;- yi + 1
  25. } else if (xp[xi,2] &lt; yp[yi,2]) {
  26. result[xp[xi,1]] &lt;- NA
  27. xi &lt;- xi + 1
  28. } else if (xp[xi,2] &gt; yp[yi,2]) {
  29. yi &lt;- yi + 1
  30. }
  31. }
  32. result
  33. }
  34. bench::mark(
  35. ave = ave(x, x, FUN = \(v) which(y == v[1])[1:length(v)]),
  36. rmatch = rmatch(x, y),
  37. make.name = match(make.names(x, TRUE), make.names(y, TRUE)),
  38. paste = do.call(match, lapply(list(x, y), \(v) paste(v, ave(v, v, FUN = seq_along)))),
  39. make.unique = match(make.unique(as.character(x)), make.unique(as.character(y))),
  40. split = {a &lt;- split(seq_along(x), x)
  41. b &lt;- split(seq_along(y), y)[names(a)]
  42. b[lengths(b)==0] &lt;- NA
  43. b &lt;- unlist(Map(`length&lt;-`, b, lengths(a)), FALSE, FALSE)
  44. `[&lt;-`(b, unlist(a, FALSE, FALSE), b)},
  45. data.table = matchall(x, y),
  46. RCPP = pm(x, y),
  47. RCPP2 = pm2(x, y)
  48. )

Result

  1. expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
  2. &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt; &lt;dbl&gt; &lt;bch:byt&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
  3. 1 ave 1.66s 1.66s 0.603 3.73GB 68.7 1 114
  4. 2 rmatch 258.29ms 259.35ms 3.86 5.34MB 30.8 2 16
  5. 3 make.name 155.69ms 156.82ms 6.37 14.06MB 1.59 4 1
  6. 4 paste 93.8ms 102.06ms 9.74 18.13MB 7.79 5 4
  7. 5 make.unique 81.67ms 92.8ms 10.4 9.49MB 5.22 6 3
  8. 6 split 12.66ms 13.16ms 65.8 7.18MB 16.0 33 8
  9. 7 data.table 6.22ms 6.89ms 114. 5.13MB 28.0 57 14
  10. 8 RCPP 3.06ms 3.2ms 301. 393.16KB 3.98 151 2
  11. 9 RCPP2 1.64ms 1.82ms 514. 393.16KB 8.00 257 4

In this case the C++ version is the fastest and allocates the lowest amount of memory. In case using base the splitB variant is the fastest and rmatch allocates the lowest amount of memory.

答案2

得分: 7

只是指出,您可以使用 match + make.unique 来实现相同的功能。从速度上来说,它可能比 data.table 方法慢:

  1. match(make.unique(as.character(x)), make.unique(as.character(y)))
  2. [1] 1 5 NA 2 3 NA

  1. match(make.names(x, TRUE), make.names(y, TRUE))
  2. [1] 1 5 NA 2 3 NA
英文:

Just to point out, you can use match + make.unique to accomplish the same. Speedwise, it might be slower than the data.table approach:

  1. match(make.unique(as.character(x)), make.unique(as.character(y)))
  2. [1] 1 5 NA 2 3 NA

  1. match(make.names(x, TRUE), make.names(y, TRUE))
  2. [1] 1 5 NA 2 3 NA

答案3

得分: 6

使用data.table的连接方法,受到问题的启发。

  1. library(data.table)
  2. matchall <- function(x, y) {
  3. data.table(y, rowid(y))[
  4. data.table(x, rowid(x)), on = .(y = x, V2), which = TRUE
  5. ]
  6. }

检查行为

  1. x <- c(3L, 1L, 2L, 3L, 3L, 2L)
  2. y <- c(3L, 3L, 3L, 3L, 1L, 3L)
  3. matchall(x, y)
  4. #> [1] 1 5 NA 2 3 NA

在更大的向量上进行时间测试:

  1. set.seed(5)
  2. x <- sample(5e3, 1e5, replace = TRUE)
  3. y <- sample(x, replace = TRUE)
  4. system.time(z1 <- matchall(x, y))
  5. #> user system elapsed
  6. #> 0.06 0.00 0.01
  7. system.time(z2 <- ave(x, x, FUN = \(v) which(y == v[1])[1:length(v)]))
  8. #> user system elapsed
  9. #> 0.88 0.43 1.31
  10. identical(z1, z2)
  11. #> [1] TRUE
英文:

Using a data.table join, inspired by this Q&A.

  1. library(data.table)
  2. matchall &lt;- function(x, y) {
  3. data.table(y, rowid(y))[
  4. data.table(x, rowid(x)), on = .(y = x, V2), which = TRUE
  5. ]
  6. }

Check behavior

  1. x &lt;- c(3L, 1L, 2L, 3L, 3L, 2L)
  2. y &lt;- c(3L, 3L, 3L, 3L, 1L, 3L)
  3. matchall(x, y)
  4. #&gt; [1] 1 5 NA 2 3 NA

Timing on larger vectors:

  1. set.seed(5)
  2. x &lt;- sample(5e3, 1e5, replace = TRUE)
  3. y &lt;- sample(x, replace = TRUE)
  4. system.time(z1 &lt;- matchall(x, y))
  5. #&gt; user system elapsed
  6. #&gt; 0.06 0.00 0.01
  7. system.time(z2 &lt;- ave(x, x, FUN = \(v) which(y == v[1])[1:length(v)]))
  8. #&gt; user system elapsed
  9. #&gt; 0.88 0.43 1.31
  10. identical(z1, z2)
  11. #&gt; [1] TRUE

答案4

得分: 4

如果您有额外的内存可用,可以通过对值进行排序并基本上进行两个指针的遍历来加速该过程,以匹配数据。以下是示例代码:

  1. rmatch <- function(x, y) {
  2. xp <- cbind(seq_along(x), x)[order(x),]
  3. yp <- cbind(seq_along(y), y)[order(y),]
  4. result <- numeric(length(x))
  5. xi <- yi <- 1
  6. Nx <- length(x)
  7. Ny <- length(y)
  8. while (xi <= Nx) {
  9. if (yi > Ny) {
  10. result[xp[xi,1]] <- NA
  11. xi <- xi + 1
  12. } else if (xp[xi,2] == yp[yi,2]) {
  13. result[xp[xi,1]] = yp[yi,1]
  14. xi <- xi + 1
  15. yi <- yi + 1
  16. } else if (xp[xi,2] < yp[yi,2]) {
  17. result[xp[xi,1]] <- NA
  18. xi <- xi + 1
  19. } else if (xp[xi,2] > yp[yi,2]) {
  20. yi <- yi + 1
  21. }
  22. }
  23. result
  24. }

我还进行了一些基于R的选项的测试,如下所示:

  1. mbm <- microbenchmark::microbenchmark(
  2. ave = ave(x, x, FUN = \(v) which(y == v[1])[1:length(v)]),
  3. rmatch = rmatch(x, y),
  4. pmatch = pmatch(x, y),
  5. times = 20
  6. )

并且看到性能似乎表现良好:

  1. Unit: milliseconds
  2. expr min lq mean median uq max neval
  3. ave 1227.6743 1247.6980 1283.1024 1264.1485 1324.1569 1349.3276 20
  4. rmatch 198.1744 201.1058 208.3158 204.5933 209.4863 247.7279 20
  5. pmatch 39514.4227 39595.9720 39717.5887 39628.0892 39805.2405 40105.4337 20

这些都返回相同的值向量。

英文:

If you have some extra memory to spare, you can speed up the process by sorting the values and basically doing a two-pointer walk through to match up the data. Here's what what would look like

  1. rmatch &lt;- function(x, y) {
  2. xp &lt;- cbind(seq_along(x), x)[order(x),]
  3. yp &lt;- cbind(seq_along(y), y)[order(y),]
  4. result &lt;- numeric(length(x))
  5. xi &lt;- yi &lt;- 1
  6. Nx &lt;- length(x)
  7. Ny &lt;- length(y)
  8. while (xi &lt;= Nx) {
  9. if (yi &gt; Ny) {
  10. result[xp[xi,1]] &lt;- NA
  11. xi &lt;- xi + 1
  12. } else if (xp[xi,2] == yp[yi,2]) {
  13. result[xp[xi,1]] = yp[yi,1]
  14. xi &lt;- xi + 1
  15. yi &lt;- yi + 1
  16. } else if (xp[xi,2] &lt; yp[yi,2]) {
  17. result[xp[xi,1]] &lt;- NA
  18. xi &lt;- xi + 1
  19. } else if (xp[xi,2] &gt; yp[yi,2]) {
  20. yi &lt;- yi + 1
  21. }
  22. }
  23. result
  24. }

I tested with some of the other base R options posted here

  1. mbm &lt;- microbenchmark::microbenchmark(
  2. ave = ave(x, x, FUN = \(v) which(y == v[1])[1:length(v)]),
  3. rmatch = rmatch(x, y),
  4. pmatch = pmatch(x, y),
  5. times = 20
  6. )

And saw that it seemed to perform well

  1. Unit: milliseconds
  2. expr min lq mean median uq max neval
  3. ave 1227.6743 1247.6980 1283.1024 1264.1485 1324.1569 1349.3276 20
  4. rmatch 198.1744 201.1058 208.3158 204.5933 209.4863 247.7279 20
  5. pmatch 39514.4227 39595.9720 39717.5887 39628.0892 39805.2405 40105.4337 20

These all return the same vector of values.

答案5

得分: 2

你可以简单地运行 match + paste + ave

  1. &gt; do.call(match, lapply(list(x, y), \(v) paste(v, ave(v, v, FUN = seq_along))))
  2. [1] 1 5 NA 2 3 NA
英文:

You can simply run match + paste + ave

  1. &gt; do.call(match, lapply(list(x, y), \(v) paste(v, ave(v, v, FUN = seq_along))))
  2. [1] 1 5 NA 2 3 NA

huangapple
  • 本文由 发表于 2023年6月1日 22:04:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76382756.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定