更改多个列中的多个值的非逐行方法,如果它们存在于其他列中。

huangapple go评论63阅读模式
英文:

Non-rowwise method to change multiple values in columns if present in other columns

问题

Here's the translated code portion you requested:

在下面的示例df中,对于每一行,我想确定b1:b3中是否有任何值在a1:a3中。如果找不到匹配项,那么将不匹配的b1:b3值更改为NA。请注意,df中a和b列的数量可能会有所不同,但命名约定保持不变:

```r
df <- structure(list(row = c(1L, 12L, 17L, 44L, 55L, 90L), 
                     b1 = c(55L,17L, 12L, 12L, 1L, 12L), 
                     b2 = c(NA, 44L, 44L, 17L, NA, 17L),
                     b3 = c(NA, 90L, 90L, 90L, NA, 44L), 
                     a1 = c(55L, 17L, 12L, 12L, 1L, 12L), 
                     a2 = c(NA, 44L, 44L, 17L, NA, 28L), 
                     a3 = c(NA, 90L, NA, 90L, NA, 44L)), 
                row.names = c(NA, -6L), 
                class = "data.frame")

期望的输出:

  row b1 b2 b3 a1 a2 a3
1   1 55 NA NA 55 NA NA
2  12 17 44 90 17 44 90
3  17 12 44 NA 12 44 NA
4  44 12 17 90 12 17 90
5  55  1 NA NA  1 NA NA
6  90 12 NA 44 12 28 44

我可以通过以下for循环实现期望的输出:

acols <- grep("^a", colnames(df))
bcols <- grep("^b", colnames(df))

for(i in 1:nrow(df)) {
  
  for(j in bcols) {
    
    if(df[i, j] %in% df[i, acols]) {
      
      next
      
    } else {
      
      df[i, j] <- NA
      
    }
      
  }

}

但是,我正在寻找更高效的方法,因为实际数据要大得多。tidyverse或基本R是否有替代方法?我已经搜索了across()方法,但是无法将任何解决方案转化为我的需求。


<details>
<summary>英文:</summary>

In the example df below, for each row I want to determine if any values in b1:b3 are in a1:a3. If a match is not found, then change the unmatched b1:b3 value to NA. Note that the number of a and b columns in df may vary but the naming convention remains the same:

```r
df &lt;- structure(list(row = c(1L, 12L, 17L, 44L, 55L, 90L), 
                     b1 = c(55L,17L, 12L, 12L, 1L, 12L), 
                     b2 = c(NA, 44L, 44L, 17L, NA, 17L),
                     b3 = c(NA, 90L, 90L, 90L, NA, 44L), 
                     a1 = c(55L, 17L, 12L, 12L, 1L, 12L), 
                     a2 = c(NA, 44L, 44L, 17L, NA, 28L), 
                     a3 = c(NA, 90L, NA, 90L, NA, 44L)), 
                row.names = c(NA, -6L), 
                class = &quot;data.frame&quot;)

Desired output:

  row b1 b2 b3 a1 a2 a3
1   1 55 NA NA 55 NA NA
2  12 17 44 90 17 44 90
3  17 12 44 NA 12 44 NA
4  44 12 17 90 12 17 90
5  55  1 NA NA  1 NA NA
6  90 12 NA 44 12 28 44

I can achieve the desired output with the following for loop:

acols &lt;- grep(&quot;^a&quot;, colnames(df))
bcols &lt;- grep(&quot;^b&quot;, colnames(df))

for(i in 1:nrow(df)) {
  
  for(j in bcols) {
    
    if(df[i, j] %in% df[i, acols]) {
      
      next
      
    } else {
      
      df[i, j] &lt;- NA
      
    }
      
  }

}

but I'm looking for a more efficient method as the real data are much larger. Does the tidyverse or base R have an alternative to this? I've searched for across() approaches but I couldn't translate any solutions to my needs.

答案1

得分: 3

这里有一个基本解决方案,应该非常快:

acols &lt;- grep(&quot;^a&quot;, colnames(df))
bcols &lt;- grep(&quot;^b&quot;, colnames(df))

df[bcols] = lapply(
  df[bcols], 
  \(bb) {
    bb[rowSums(bb == df[acols], na.rm = TRUE) == 0] = NA_integer_
    bb
  })

df
#   row b1 b2 b3 a1 a2 a3
# 1   1 55 NA NA 55 NA NA
# 2  12 17 44 90 17 44 90
# 3  17 12 44 NA 12 44 NA
# 4  44 12 17 90 12 17 90
# 5  55  1 NA NA  1 NA NA
# 6  90 12 NA 44 12 28 44

我将您的数据扩展到了500k行,对于该数据集,该方法运行速度大约快了50倍,使用了1/50的内存。我怀疑随着列数的增加,差异会更大。

single_loop = function(df){
  df[bcols] = lapply(
  df[bcols], 
  \(bb) {
    bb[rowSums(bb == df[acols], na.rm = TRUE) == 0] = NA
    bb
  })
  df
}

double_loop = function(df){
  for (i in 1:nrow(df)) {
    for (j in bcols) {
      if (df[i, j] %in% df[i, acols]) {
        next
      } else {
        df[i, j] &lt;- NA
      }
    }
  }
  df
}
df_big = df[sample(1:nrow(df), size = 5e5, replace = TRUE)]

bench::mark(
  op = double_loop(df_big),
  gregor = single_loop(df_big)
)
# # A tibble: 2 &#215; 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result   
#   &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt;     &lt;dbl&gt; &lt;bch:byt&gt;    &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;   &lt;bch:tm&gt; &lt;list&gt;   
# 1 op          624.8ms  624.8ms      1.60     381MB     9.60     1     6      625ms &lt;df [6 &#215;…
# 2 gregor       12.8ms   13.4ms     48.5     7.63MB     9.33    26     5      536ms &lt;df [6 &#215;…
# # … with 3 more variables: memory &lt;list&gt;, time &lt;list&gt;, gc &lt;list&gt;
# Warning message:
# Some expressions had a GC in every iteration; so filtering is disabled.
英文:

Here's a base solution that should be quite fast:

acols &lt;- grep(&quot;^a&quot;, colnames(df))
bcols &lt;- grep(&quot;^b&quot;, colnames(df))

df[bcols] = lapply(
  df[bcols], 
  \(bb) {
    bb[rowSums(bb == df[acols], na.rm = TRUE) == 0] = NA_integer_
    bb
  })

df
#   row b1 b2 b3 a1 a2 a3
# 1   1 55 NA NA 55 NA NA
# 2  12 17 44 90 17 44 90
# 3  17 12 44 NA 12 44 NA
# 4  44 12 17 90 12 17 90
# 5  55  1 NA NA  1 NA NA
# 6  90 12 NA 44 12 28 44

I scaled your data up to 500k rows, and on that dataset this approach runs about 50x faster using 1/50 of the memory. I suspect the differences will be bigger as the number of columns increases.

single_loop = function(df){
  df[bcols] = lapply(
  df[bcols], 
  \(bb) {
    bb[rowSums(bb == df[acols], na.rm = TRUE) == 0] = NA
    bb
  })
  df
}

double_loop = function(df){
  for (i in 1:nrow(df)) {
    for (j in bcols) {
      if (df[i, j] %in% df[i, acols]) {
        next
      } else {
        df[i, j] &lt;- NA
      }
    }
  }
  df
}
df_big = df[sample(1:nrow(df), size = 5e5, replace = TRUE)]

bench::mark(
  op = double_loop(df_big),
  gregor = single_loop(df_big)
)
# # A tibble: 2 &#215; 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result   
#   &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt;     &lt;dbl&gt; &lt;bch:byt&gt;    &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;   &lt;bch:tm&gt; &lt;list&gt;   
# 1 op          624.8ms  624.8ms      1.60     381MB     9.60     1     6      625ms &lt;df [6 &#215;…
# 2 gregor       12.8ms   13.4ms     48.5     7.63MB     9.33    26     5      536ms &lt;df [6 &#215;…
# # … with 3 more variables: memory &lt;list&gt;, time &lt;list&gt;, gc &lt;list&gt;
# Warning message:
# Some expressions had a GC in every iteration; so filtering is disabled. 

答案2

得分: 2

如果需要检查任何位置的值,您可以进行数据往返透视:

使用库(tidyr)
使用库(dplyr)

数据框(df) %>%
  透视至长格式(-行) %>%
  变异(grp = substr(name, 1L, 1L),
         value = replace(value, !(value[grp == "b"] %in% value[grp == "a"])[1:n()], NA), 
         grp = NULL,
         .by = 行) %>%
  透视至宽格式()
  
# 结果数据框: 6 × 7
    行    b1    b2    b3    a1    a2    a3
  <int> <int> <int> <int> <int> <int> <int>
1     1    55    NA    NA    55    NA    NA
2    12    17    44    90    17    44    90
3    17    12    44    NA    12    44    NA
4    44    12    17    90    12    17    90
5    55     1    NA    NA     1    NA    NA
6    90    12    NA    44    12    28    44

如果可以按列对检查是否一致,可以使用以下方式:

数据框(df) %>%
  变异(replace(pick(b1:b3), pick(a1:a3) != pick(b1:b3) | is.na(pick(a1:a3)), NA))

  行 b1 b2 b3 a1 a2 a3
1   1 55 NA NA 55 NA NA
2  12 17 44 90 17 44 90
3  17 12 44 NA 12 44 NA
4  44 12 17 90 12 17 90
5  55  1 NA NA  1 NA NA
6  90 12 NA 44 12 28 44
英文:

If you need to check for values in any position you can round-trip pivot the data:

library(tidyr)
library(dplyr)

df %&gt;%
  pivot_longer(-row) %&gt;%
  mutate(grp = substr(name, 1L, 1L),
         value = replace(value, !(value[grp == &quot;b&quot;] %in% value[grp == &quot;a&quot;])[1:n()], NA), 
         grp = NULL,
         .by = row) %&gt;%
  pivot_wider()
  
# A tibble: 6 &#215; 7
    row    b1    b2    b3    a1    a2    a3
  &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
1     1    55    NA    NA    55    NA    NA
2    12    17    44    90    17    44    90
3    17    12    44    NA    12    44    NA
4    44    12    17    90    12    17    90
5    55     1    NA    NA     1    NA    NA
6    90    12    NA    44    12    28    44

If it's ok to check by column pairs, you could do:

df %&gt;%
  mutate(replace(pick(b1:b3), pick(a1:a3) != pick(b1:b3) | is.na(pick(a1:a3)), NA))

  row b1 b2 b3 a1 a2 a3
1   1 55 NA NA 55 NA NA
2  12 17 44 90 17 44 90
3  17 12 44 NA 12 44 NA
4  44 12 17 90 12 17 90
5  55  1 NA NA  1 NA NA
6  90 12 NA 44 12 28 44

huangapple
  • 本文由 发表于 2023年5月18日 10:11:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76277299.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定