使用dplyr在R中如何创建一个新列,该列的名称是包含最大值的列的名称?

huangapple go评论85阅读模式
英文:

How to create new column with name of column that contains maximum value using dplyr in R?

问题

以下是代码部分的翻译:

  1. 我有这样一个数据框:
  2. dat <- data.frame(var1 = rnorm(10), var2 = rnorm(10), var3 = rnorm(10), var4 = rnorm(10))
  3. > dat
  4. var1 var2 var3 var4
  5. 1 -1.3784414 1.06816022 1.46578217 -0.4141153
  6. 2 -0.3272332 -0.69470574 0.02220395 -0.5502878
  7. 3 0.2559891 -0.06964848 -0.34745180 0.6399705
  8. 4 0.6029044 1.23680560 -0.72392358 -0.1990832
  9. 5 1.3097174 -0.58028595 -0.01487186 -0.8765290
  10. 6 -1.2356668 0.41330063 -1.00375989 -1.1974204
  11. 7 -0.4126320 3.83320678 -1.42059022 -0.6747575
  12. 8 1.7339653 0.58610348 0.40200428 1.4582103
  13. 9 1.2994859 1.65355306 0.75985071 0.6455882
  14. 10 -0.2353356 2.04468739 -0.11521602 0.3251901
  15. 目标是创建一个新的列,该列包含每行中在var2var3var4列中包含最大值的列的名称。
  16. 使用以下命令不会得到正确的输出:
  17. library(dplyr)
  18. dat %>%
  19. rowwise() %>%
  20. mutate(var.max = colnames(.)[which.max(c_across(var2:var4))])
  21. # A tibble: 10 x 5
  22. # Rowwise:
  23. var1 var2 var3 var4 var.max
  24. <dbl> <dbl> <dbl> <dbl> <chr>
  25. 1 -1.38 1.07 1.47 -0.414 var2
  26. 2 -0.327 -0.695 0.0222 -0.550 var2
  27. 3 0.256 -0.0696 -0.347 0.640 var3
  28. 4 0.603 1.24 -0.724 -0.199 var1
  29. 5 1.31 -0.580 -0.0149 -0.877 var2
  30. 6 -1.24 0.413 -1.00 -1.20 var1
  31. 7 -0.413 3.83 -1.42 -0.675 var1
  32. 8 1.73 0.586 0.402 1.46 var3
  33. 9 1.30 1.65 0.760 0.646 var1
  34. 10 -0.235 2.04 -0.115 0.325 var1
  35. 但是,如果从数据中排除列var1,它可以正常工作:
  36. dat %>%
  37. select(-var1) %>%
  38. rowwise() %>%
  39. mutate(var.max = colnames(.)[which.max(c_across(var2:var4))])
  40. # A tibble: 10 x 4
  41. # Rowwise:
  42. var2 var3 var4 var.max
  43. <dbl> <dbl> <dbl> <chr>
  44. 1 1.07 1.47 -0.414 var3
  45. 2 -0.695 0.0222 -0.550 var3
  46. 3 -0.0696 -0.347 0.640 var4
  47. 4 1.24 -0.724 -0.199 var2
  48. 5 -0.580 -0.0149 -0.877 var3
  49. 6 0.413 -1.00 -1.20 var2
  50. 7 3.83 -1.42 -0.675 var2
  51. 8 0.586 0.402 1.46 var4
  52. 9 1.65 0.760 0.646 var2
  53. 10 2.04 -0.115 0.325 var2
  54. 就像当var1在最后位置时一样:
  55. dat %>%
  56. select(var2, var3, var4, var1) %>%
  57. rowwise() %>%
  58. mutate(var.max = colnames(.)[which.max(c_across(var2:var4))])
  59. # A tibble: 10 x 5
  60. # Rowwise:
  61. var2 var3 var4 var1 var.max
  62. <dbl> <dbl> <dbl> <dbl> <chr>
  63. 1 1.07 1.47 -0.414 -1.38 var3
  64. 2 -0.695 0.0222 -0.550 -0.327 var3
  65. 3 -0.0696 -0.347 0.640 0.256 var4
  66. 4 1.24 -0.724 -0.199 0.603 var2
  67. 5 -0.580 -0.0149 -0.877 1.31 var3
  68. 6 0.413 -1.00 -1.20 -1.24 var2
  69. 7 3.83 -1.42 -0.675 -0.413 var2
  70. 8 0.586 0.402 1.46 1.73 var4
  71. 9 1.65 0.760 0.646 1.30 var2
  72. 10 2.04 -0.115 0.325 -0.235 var2
  73. 我在这里漏掉了什么?
英文:

I have such a data frame:

  1. dat &lt;- data.frame(var1 = rnorm(10), var2 = rnorm(10), var3 = rnorm(10), var4 = rnorm(10))
  2. &gt; dat
  3. var1 var2 var3 var4
  4. 1 -1.3784414 1.06816022 1.46578217 -0.4141153
  5. 2 -0.3272332 -0.69470574 0.02220395 -0.5502878
  6. 3 0.2559891 -0.06964848 -0.34745180 0.6399705
  7. 4 0.6029044 1.23680560 -0.72392358 -0.1990832
  8. 5 1.3097174 -0.58028595 -0.01487186 -0.8765290
  9. 6 -1.2356668 0.41330063 -1.00375989 -1.1974204
  10. 7 -0.4126320 3.83320678 -1.42059022 -0.6747575
  11. 8 1.7339653 0.58610348 0.40200428 1.4582103
  12. 9 1.2994859 1.65355306 0.75985071 0.6455882
  13. 10 -0.2353356 2.04468739 -0.11521602 0.3251901

The aim is to create a new column with the name of the column that contains the maximum value in each row within columns var2, var3 and var4.

Using the following command does not result in the correct output:

  1. library(dplyr)
  2. dat %&gt;%
  3. rowwise() %&gt;%
  4. mutate(var.max = colnames(.)[which.max(c_across(var2:var4))])
  5. # A tibble: 10 x 5
  6. # Rowwise:
  7. var1 var2 var3 var4 var.max
  8. &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;
  9. 1 -1.38 1.07 1.47 -0.414 var2
  10. 2 -0.327 -0.695 0.0222 -0.550 var2
  11. 3 0.256 -0.0696 -0.347 0.640 var3
  12. 4 0.603 1.24 -0.724 -0.199 var1
  13. 5 1.31 -0.580 -0.0149 -0.877 var2
  14. 6 -1.24 0.413 -1.00 -1.20 var1
  15. 7 -0.413 3.83 -1.42 -0.675 var1
  16. 8 1.73 0.586 0.402 1.46 var3
  17. 9 1.30 1.65 0.760 0.646 var1
  18. 10 -0.235 2.04 -0.115 0.325 var1

But if the column var1 is excluded from the data it works:

  1. dat %&gt;%
  2. select(-var1) %&gt;%
  3. rowwise() %&gt;%
  4. mutate(var.max = colnames(.)[which.max(c_across(var2:var4))])
  5. # A tibble: 10 x 4
  6. # Rowwise:
  7. var2 var3 var4 var.max
  8. &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;
  9. 1 1.07 1.47 -0.414 var3
  10. 2 -0.695 0.0222 -0.550 var3
  11. 3 -0.0696 -0.347 0.640 var4
  12. 4 1.24 -0.724 -0.199 var2
  13. 5 -0.580 -0.0149 -0.877 var3
  14. 6 0.413 -1.00 -1.20 var2
  15. 7 3.83 -1.42 -0.675 var2
  16. 8 0.586 0.402 1.46 var4
  17. 9 1.65 0.760 0.646 var2
  18. 10 2.04 -0.115 0.325 var2

.. just like when var1 is at the last position:

  1. dat %&gt;%
  2. select(var2, var3, var4, var1) %&gt;%
  3. rowwise() %&gt;%
  4. mutate(var.max = colnames(.)[which.max(c_across(var2:var4))])
  5. # A tibble: 10 x 5
  6. # Rowwise:
  7. var2 var3 var4 var1 var.max
  8. &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;
  9. 1 1.07 1.47 -0.414 -1.38 var3
  10. 2 -0.695 0.0222 -0.550 -0.327 var3
  11. 3 -0.0696 -0.347 0.640 0.256 var4
  12. 4 1.24 -0.724 -0.199 0.603 var2
  13. 5 -0.580 -0.0149 -0.877 1.31 var3
  14. 6 0.413 -1.00 -1.20 -1.24 var2
  15. 7 3.83 -1.42 -0.675 -0.413 var2
  16. 8 0.586 0.402 1.46 1.73 var4
  17. 9 1.65 0.760 0.646 1.30 var2
  18. 10 2.04 -0.115 0.325 -0.235 var2

What am I missing here?

答案1

得分: 3

为了继续你的逻辑,并且你只是移除了第一列,只需将 1 添加到 which.max(),即:

  1. library(dplyr)
  2. dat %>%
  3. rowwise() %>%
  4. mutate(max_col = names(dat)[which.max(c_across(var2:var4)) + 1])

如果你想指定要考虑的列,可以这样做:

  1. my_cols <- c('var2', 'var3', 'var4')
  2. dat %>%
  3. rowwise() %>%
  4. mutate(max_col = names(dat)[which.max(c_across(names(dat)[names(dat) %in% my_cols])) + (ncol(dat) - length(my_cols))])
英文:

To continue your logic and since you are only removing the firsrt column, just add 1 to which.max(), i.e.

  1. library(dplyr)
  2. dat %&gt;%
  3. rowwise() %&gt;%
  4. mutate(max_col = names(dat)[which.max(c_across(var2:var4)) + 1])
  5. # A tibble: 10 &#215; 5
  6. # Rowwise:
  7. var1 var2 var3 var4 max_col
  8. &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;
  9. 1 -1.09 0.768 0.251 -2.67 var2
  10. 2 -0.822 -1.37 0.901 1.83 var4
  11. 3 0.0280 -0.00555 -0.0709 0.729 var4
  12. 4 1.45 -0.132 -2.47 1.45 var4
  13. 5 0.506 -1.31 -2.75 -0.264 var4
  14. 6 -0.00538 1.31 -0.368 0.00679 var2
  15. 7 -0.166 -0.976 -1.42 1.50 var4
  16. 8 -0.377 -0.101 0.135 0.784 var4
  17. 9 0.535 0.438 0.0597 0.924 var4
  18. 10 0.281 -0.481 -0.00177 -0.601 var3

If you want to do it by specifying which columns to consider then,

  1. my_cols &lt;- c(&#39;var2&#39;, &#39;var3&#39;, &#39;var4&#39;)
  2. dat %&gt;%
  3. rowwise() %&gt;%
  4. mutate(max_col = names(dat)[which.max(c_across(names(dat)[names(dat) %in% my_cols])) + (ncol(dat) - length(my_cols))])

答案2

得分: 1

以下是代码部分的翻译:

  1. library(dplyr)
  2. max_col_name <- function(...) {
  3. row_dat <- across(c(...)) # 如果 dplyr 版本 >= 1.1,请使用 `pick()` 代替 `across()`
  4. names(row_dat)[which.max(row_dat)]
  5. }
  6. dat %>%
  7. rowwise() %>%
  8. mutate(max_col = max_col_name(var2:var4))

数据来源于 OP:

  1. set.seed(123)
  2. dat <- data.frame(var1 = rnorm(10), var2 = rnorm(10), var3 = rnorm(10), var4 = rnorm(10))

创建于 2023-02-23,使用 reprex 包 (v2.0.1)

英文:

If you want to avoid adding the number of columns which are left out (in the above case +1) then we can write a custom function max_col_name() using across() or pick():

  1. library(dplyr)
  2. max_col_name &lt;- function(...) {
  3. row_dat &lt;- across(c(...)) # if dplyr v &gt;= v 1.1. use `pick()` instead of `across()`
  4. names(row_dat)[which.max(row_dat)]
  5. }
  6. dat %&gt;%
  7. rowwise() %&gt;%
  8. mutate(max_col = max_col_name(var2:var4))
  9. #&gt; # A tibble: 10 x 5
  10. #&gt; # Rowwise:
  11. #&gt; var1 var2 var3 var4 max_col
  12. #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;
  13. #&gt; 1 -0.560 1.22 -1.07 0.426 var2
  14. #&gt; 2 -0.230 0.360 -0.218 -0.295 var2
  15. #&gt; 3 1.56 0.401 -1.03 0.895 var4
  16. #&gt; 4 0.0705 0.111 -0.729 0.878 var4
  17. #&gt; 5 0.129 -0.556 -0.625 0.822 var4
  18. #&gt; 6 1.72 1.79 -1.69 0.689 var2
  19. #&gt; 7 0.461 0.498 0.838 0.554 var3
  20. #&gt; 8 -1.27 -1.97 0.153 -0.0619 var3
  21. #&gt; 9 -0.687 0.701 -1.14 -0.306 var2
  22. #&gt; 10 -0.446 -0.473 1.25 -0.380 var3

Data from the OP

  1. set.seed(123)
  2. dat &lt;- data.frame(var1 = rnorm(10), var2 = rnorm(10), var3 = rnorm(10), var4 = rnorm(10))

<sup>Created on 2023-02-23 by the reprex package (v2.0.1)</sup>

huangapple
  • 本文由 发表于 2023年2月23日 20:13:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/75544670.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定