有效的代码来删除包含非唯一最大值的行?

huangapple go评论102阅读模式
英文:

Efficient code to remove rows containing non-unique max?

问题

这是一个关于提取具有唯一最大值的行的简单数组的示例。以下是有效的代码:

  1. winners <- max.col(foo)
  2. unique_max <- apply(foo, 1, function(row) length(which(row == max(row))) == 1)
  3. foo <- foo[unique_max, ]

这段代码将提取具有唯一最大值的行并将它们存储在foo中。这段代码更有效率并只使用了基本的R调用。

英文:

Here's a simple example of an array for which I want to extract only those rows whose max value is unique (in that row).

  1. foo &lt;- expand.grid(1:3,1:3,1:3)
  2. Var1 Var2 Var3
  3. 1 1 1 1
  4. 2 2 1 1
  5. 3 3 1 1
  6. 4 1 2 1
  7. 5 2 2 1
  8. 6 3 2 1
  9. 7 1 3 1
  10. 8 2 3 1
  11. 9 3 3 1
  12. 10 1 1 2
  13. 11 2 1 2
  14. 12 3 1 2
  15. 13 1 2 2
  16. 14 2 2 2
  17. 15 3 2 2
  18. 16 1 3 2
  19. 17 2 3 2
  20. 18 3 3 2
  21. 19 1 1 3
  22. 20 2 1 3
  23. 21 3 1 3
  24. 22 1 2 3
  25. 23 2 2 3
  26. 24 3 2 3
  27. 25 1 3 3
  28. 26 2 3 3
  29. 27 3 3 3

I've got working code:

  1. winners &lt;- max.col(foo)
  2. finddupe &lt;- rep(0,length=length(winners))
  3. for (jf in 1:length(winners)) finddupe[jf] &lt;- sum(foo[jf,] == foo[jf, winners[jf] ] )
  4. winners &lt;- winners[finddupe == 1]
  5. foo &lt;- foo[finddupe == 1, ]

That just looks inefficient to me.
I'd prefer a solution which only uses base - R calls, but am open to using tools in other libraries.

答案1

得分: 4

以下是代码的翻译部分:

Another base R solution:

  1. subset(foo, max.col(foo, 'first') == max.col(foo, 'last'))

Same logic as above in dplyr way:

  1. library(dplyr)
  2. foo %>%
  3. filter(max.col(., 'first') == max.col(., 'last'))
英文:

Another base R solution:

  1. subset(foo, max.col(foo, &#39;first&#39;) == max.col(foo, &#39;last&#39;))
  2. Var1 Var2 Var3
  3. 2 2 1 1
  4. 3 3 1 1
  5. 4 1 2 1
  6. 6 3 2 1
  7. 7 1 3 1
  8. 8 2 3 1
  9. 10 1 1 2
  10. 12 3 1 2
  11. 15 3 2 2
  12. 16 1 3 2
  13. 17 2 3 2
  14. 19 1 1 3
  15. 20 2 1 3
  16. 22 1 2 3
  17. 23 2 2 3
  18. &gt;

Same logic as above in dplyr way:

  1. library(dplyr)
  2. foo %&gt;%
  3. filter(max.col(., &#39;first&#39;) == max.col(., &#39;last&#39;))

答案2

得分: 3

使用pmax从所有列创建一个最大值的列,然后使用rowSums在逻辑数据集上筛选只包含单个唯一最大值的行。

  1. library(dplyr)
  2. foo %>%
  3. mutate(mx = do.call(pmax, c(across(everything()), na.rm = TRUE))) %>%
  4. filter(rowSums(across(Var1:Var3, ~ .x == mx), na.rm = TRUE) == 1)

输出:

  1. Var1 Var2 Var3 mx
  2. 1 2 1 1 2
  3. 2 3 1 1 3
  4. 3 1 2 1 2
  5. 4 3 2 1 3
  6. 5 1 3 1 3
  7. 6 2 3 1 3
  8. 7 1 1 2 2
  9. 8 3 1 2 3
  10. 9 3 2 2 3
  11. 10 1 3 2 3
  12. 11 2 3 2 3
  13. 12 1 1 3 3
  14. 13 2 1 3 3
  15. 14 1 2 3 3
  16. 15 2 2 3 3

或者使用base R

  1. subset(foo, rowSums(foo == do.call(pmax, c(foo, na.rm = TRUE)), na.rm = TRUE) == 1)
英文:

Create a column of max with pmax from all the columns, then filter the rows where there is only a single unique max by getting the count on a logical dataset with rowSums

  1. library(dplyr)
  2. foo %&gt;%
  3. mutate(mx = do.call(pmax, c(across(everything()), na.rm = TRUE))) %&gt;%
  4. filter(rowSums(across(Var1:Var3, ~ .x == mx), na.rm = TRUE) == 1)

-output

  1. Var1 Var2 Var3 mx
  2. 1 2 1 1 2
  3. 2 3 1 1 3
  4. 3 1 2 1 2
  5. 4 3 2 1 3
  6. 5 1 3 1 3
  7. 6 2 3 1 3
  8. 7 1 1 2 2
  9. 8 3 1 2 3
  10. 9 3 2 2 3
  11. 10 1 3 2 3
  12. 11 2 3 2 3
  13. 12 1 1 3 3
  14. 13 2 1 3 3
  15. 14 1 2 3 3
  16. 15 2 2 3 3

Or with base R

  1. subset(foo, rowSums(foo == do.call(pmax, c(foo, na.rm = TRUE)),
  2. na.rm = TRUE) == 1)

答案3

得分: 0

A base R approach using apply:

  1. foo[apply(foo, 1, function(x) sum(x[which.max(x)] == x) <= 1), ]
  2. Var1 Var2 Var3
  3. 2 2 1 1
  4. 3 3 1 1
  5. 4 1 2 1
  6. 6 3 2 1
  7. 7 1 3 1
  8. 8 2 3 1
  9. 10 1 1 2
  10. 12 3 1 2
  11. 15 3 2 2
  12. 16 1 3 2
  13. 17 2 3 2
  14. 19 1 1 3
  15. 20 2 1 3
  16. 22 1 2 3
  17. 23 2 2 3

Please note that the code section remains unchanged.

英文:

A base R approach using apply

  1. foo[apply(foo, 1, function(x) sum(x[which.max(x)] == x) &lt;= 1), ]
  2. Var1 Var2 Var3
  3. 2 2 1 1
  4. 3 3 1 1
  5. 4 1 2 1
  6. 6 3 2 1
  7. 7 1 3 1
  8. 8 2 3 1
  9. 10 1 1 2
  10. 12 3 1 2
  11. 15 3 2 2
  12. 16 1 3 2
  13. 17 2 3 2
  14. 19 1 1 3
  15. 20 2 1 3
  16. 22 1 2 3
  17. 23 2 2 3

答案4

得分: 0

@onyambu 在比赛中获胜。 (cgw 是我; ak** 是 akrun 的解决方案)

  1. bar5 = 1:5
  2. foo55 <- expand.grid(bar5,bar5,bar5,bar5,bar5)
  3. microbenchmark(ony(foo55), cgw(foo55), akply(foo55), akbase(foo55), andre(foo55))
  4. Unit: microseconds
  5. expr min lq mean median uq max neval cld
  6. ony(foo55) 455.117 495.2335 589.6801 517.3755 634.9795 3107.222 100 a
  7. cgw(foo55) 314076.038 317184.4050 348711.9522 319784.5870 324921.0335 2691161.873 100 b
  8. akply(foo55) 14156.653 14835.2230 16194.3699 15160.0270 16441.3550 74019.622 100 a
  9. akbase(foo55) 858.969 896.8310 1055.4277 970.6395 1117.2420 4098.860 100 a
  10. andre(foo55) 8161.406 8531.1700 9188.4801 8872.0325 9284.0995 14548.383 100 a
英文:

After verifying the answers so far (18:00 EST Weds 15 Feb), I ran a benchmark comparison. @onyambu wins the race. (cgw is me; ak** are akrun's solutions)

  1. bar5 = 1:5
  2. foo55 &lt;- expand.grid(bar5,bar5,bar5,bar5,bar5)
  3. microbenchmark(ony(foo55), cgw(foo55), akply(foo55), akbase(foo55), andre(foo55))
  4. Unit: microseconds
  5. expr min lq mean median uq max neval cld
  6. ony(foo55) 455.117 495.2335 589.6801 517.3755 634.9795 3107.222 100 a
  7. cgw(foo55) 314076.038 317184.4050 348711.9522 319784.5870 324921.0335 2691161.873 100 b
  8. akply(foo55) 14156.653 14835.2230 16194.3699 15160.0270 16441.3550 74019.622 100 a
  9. akbase(foo55) 858.969 896.8310 1055.4277 970.6395 1117.2420 4098.860 100 a
  10. andre(foo55) 8161.406 8531.1700 9188.4801 8872.0325 9284.0995 14548.383 100 a

huangapple
  • 本文由 发表于 2023年2月16日 02:03:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/75463792.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定