2023年5月18日 10:11:04go评论81阅读模式

英文:

Non-rowwise method to change multiple values in columns if present in other columns

问题

Here's the translated code portion you requested:

在下面的示例df中，对于每一行，我想确定b1:b3中是否有任何值在a1:a3中。如果找不到匹配项，那么将不匹配的b1:b3值更改为NA。请注意，df中a和b列的数量可能会有所不同，但命名约定保持不变：

```r
df &lt;- structure(list(row = c(1L, 12L, 17L, 44L, 55L, 90L), 
                     b1 = c(55L,17L, 12L, 12L, 1L, 12L), 
                     b2 = c(NA, 44L, 44L, 17L, NA, 17L),
                     b3 = c(NA, 90L, 90L, 90L, NA, 44L), 
                     a1 = c(55L, 17L, 12L, 12L, 1L, 12L), 
                     a2 = c(NA, 44L, 44L, 17L, NA, 28L), 
                     a3 = c(NA, 90L, NA, 90L, NA, 44L)), 
                row.names = c(NA, -6L), 
                class = &quot;data.frame&quot;)

期望的输出：

  row b1 b2 b3 a1 a2 a3
1   1 55 NA NA 55 NA NA
2  12 17 44 90 17 44 90
3  17 12 44 NA 12 44 NA
4  44 12 17 90 12 17 90
5  55  1 NA NA  1 NA NA
6  90 12 NA 44 12 28 44

我可以通过以下for循环实现期望的输出：

acols &lt;- grep(&quot;^a&quot;, colnames(df))
bcols &lt;- grep(&quot;^b&quot;, colnames(df))

for(i in 1:nrow(df)) {
  
  for(j in bcols) {
    
    if(df[i, j] %in% df[i, acols]) {
      
      next
      
    } else {
      
      df[i, j] &lt;- NA
      
    }
      
  }

}

但是，我正在寻找更高效的方法，因为实际数据要大得多。tidyverse或基本R是否有替代方法？我已经搜索了across()方法，但是无法将任何解决方案转化为我的需求。


<details>
<summary>英文:</summary>

In the example df below, for each row I want to determine if any values in b1:b3 are in a1:a3. If a match is not found, then change the unmatched b1:b3 value to NA. Note that the number of a and b columns in df may vary but the naming convention remains the same:

```r
df &lt;- structure(list(row = c(1L, 12L, 17L, 44L, 55L, 90L), 
                     b1 = c(55L,17L, 12L, 12L, 1L, 12L), 
                     b2 = c(NA, 44L, 44L, 17L, NA, 17L),
                     b3 = c(NA, 90L, 90L, 90L, NA, 44L), 
                     a1 = c(55L, 17L, 12L, 12L, 1L, 12L), 
                     a2 = c(NA, 44L, 44L, 17L, NA, 28L), 
                     a3 = c(NA, 90L, NA, 90L, NA, 44L)), 
                row.names = c(NA, -6L), 
                class = &quot;data.frame&quot;)

Desired output:

  row b1 b2 b3 a1 a2 a3
1   1 55 NA NA 55 NA NA
2  12 17 44 90 17 44 90
3  17 12 44 NA 12 44 NA
4  44 12 17 90 12 17 90
5  55  1 NA NA  1 NA NA
6  90 12 NA 44 12 28 44

I can achieve the desired output with the following for loop:

acols &lt;- grep(&quot;^a&quot;, colnames(df))
bcols &lt;- grep(&quot;^b&quot;, colnames(df))

for(i in 1:nrow(df)) {
  
  for(j in bcols) {
    
    if(df[i, j] %in% df[i, acols]) {
      
      next
      
    } else {
      
      df[i, j] &lt;- NA
      
    }
      
  }

}

but I'm looking for a more efficient method as the real data are much larger. Does the tidyverse or base R have an alternative to this? I've searched for across() approaches but I couldn't translate any solutions to my needs.

答案1

得分: 3

这里有一个基本解决方案，应该非常快：

acols &lt;- grep(&quot;^a&quot;, colnames(df))
bcols &lt;- grep(&quot;^b&quot;, colnames(df))

df[bcols] = lapply(
  df[bcols], 
  \(bb) {
    bb[rowSums(bb == df[acols], na.rm = TRUE) == 0] = NA_integer_
    bb
  })

df
#   row b1 b2 b3 a1 a2 a3
# 1   1 55 NA NA 55 NA NA
# 2  12 17 44 90 17 44 90
# 3  17 12 44 NA 12 44 NA
# 4  44 12 17 90 12 17 90
# 5  55  1 NA NA  1 NA NA
# 6  90 12 NA 44 12 28 44

我将您的数据扩展到了500k行，对于该数据集，该方法运行速度大约快了50倍，使用了1/50的内存。我怀疑随着列数的增加，差异会更大。

single_loop = function(df){
  df[bcols] = lapply(
  df[bcols], 
  \(bb) {
    bb[rowSums(bb == df[acols], na.rm = TRUE) == 0] = NA
    bb
  })
  df
}

double_loop = function(df){
  for (i in 1:nrow(df)) {
    for (j in bcols) {
      if (df[i, j] %in% df[i, acols]) {
        next
      } else {
        df[i, j] &lt;- NA
      }
    }
  }
  df
}

df_big = df[sample(1:nrow(df), size = 5e5, replace = TRUE)]

bench::mark(
  op = double_loop(df_big),
  gregor = single_loop(df_big)
)
# # A tibble: 2 &#215; 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result   
#   &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt;     &lt;dbl&gt; &lt;bch:byt&gt;    &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;   &lt;bch:tm&gt; &lt;list&gt;   
# 1 op          624.8ms  624.8ms      1.60     381MB     9.60     1     6      625ms &lt;df [6 &#215;…
# 2 gregor       12.8ms   13.4ms     48.5     7.63MB     9.33    26     5      536ms &lt;df [6 &#215;…
# # … with 3 more variables: memory &lt;list&gt;, time &lt;list&gt;, gc &lt;list&gt;
# Warning message:
# Some expressions had a GC in every iteration; so filtering is disabled.

英文:

Here's a base solution that should be quite fast:

acols &lt;- grep(&quot;^a&quot;, colnames(df))
bcols &lt;- grep(&quot;^b&quot;, colnames(df))

df[bcols] = lapply(
  df[bcols], 
  \(bb) {
    bb[rowSums(bb == df[acols], na.rm = TRUE) == 0] = NA_integer_
    bb
  })

df
#   row b1 b2 b3 a1 a2 a3
# 1   1 55 NA NA 55 NA NA
# 2  12 17 44 90 17 44 90
# 3  17 12 44 NA 12 44 NA
# 4  44 12 17 90 12 17 90
# 5  55  1 NA NA  1 NA NA
# 6  90 12 NA 44 12 28 44

I scaled your data up to 500k rows, and on that dataset this approach runs about 50x faster using 1/50 of the memory. I suspect the differences will be bigger as the number of columns increases.

single_loop = function(df){
  df[bcols] = lapply(
  df[bcols], 
  \(bb) {
    bb[rowSums(bb == df[acols], na.rm = TRUE) == 0] = NA
    bb
  })
  df
}

double_loop = function(df){
  for (i in 1:nrow(df)) {
    for (j in bcols) {
      if (df[i, j] %in% df[i, acols]) {
        next
      } else {
        df[i, j] &lt;- NA
      }
    }
  }
  df
}

df_big = df[sample(1:nrow(df), size = 5e5, replace = TRUE)]

bench::mark(
  op = double_loop(df_big),
  gregor = single_loop(df_big)
)
# # A tibble: 2 &#215; 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result   
#   &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt;     &lt;dbl&gt; &lt;bch:byt&gt;    &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;   &lt;bch:tm&gt; &lt;list&gt;   
# 1 op          624.8ms  624.8ms      1.60     381MB     9.60     1     6      625ms &lt;df [6 &#215;…
# 2 gregor       12.8ms   13.4ms     48.5     7.63MB     9.33    26     5      536ms &lt;df [6 &#215;…
# # … with 3 more variables: memory &lt;list&gt;, time &lt;list&gt;, gc &lt;list&gt;
# Warning message:
# Some expressions had a GC in every iteration; so filtering is disabled.

答案2

得分: 2

如果需要检查任何位置的值，您可以进行数据往返透视：

使用库(tidyr)
使用库(dplyr)

数据框(df) %>%
  透视至长格式(-行) %>%
  变异(grp = substr(name, 1L, 1L),
         value = replace(value, !(value[grp == "b"] %in% value[grp == "a"])[1:n()], NA), 
         grp = NULL,
         .by = 行) %>%
  透视至宽格式()
  
# 结果数据框: 6 × 7
    行    b1    b2    b3    a1    a2    a3
  <int> <int> <int> <int> <int> <int> <int>
1     1    55    NA    NA    55    NA    NA
2    12    17    44    90    17    44    90
3    17    12    44    NA    12    44    NA
4    44    12    17    90    12    17    90
5    55     1    NA    NA     1    NA    NA
6    90    12    NA    44    12    28    44

如果可以按列对检查是否一致，可以使用以下方式：

数据框(df) %>%
  变异(replace(pick(b1:b3), pick(a1:a3) != pick(b1:b3) | is.na(pick(a1:a3)), NA))

  行 b1 b2 b3 a1 a2 a3
1   1 55 NA NA 55 NA NA
2  12 17 44 90 17 44 90
3  17 12 44 NA 12 44 NA
4  44 12 17 90 12 17 90
5  55  1 NA NA  1 NA NA
6  90 12 NA 44 12 28 44

英文:

If you need to check for values in any position you can round-trip pivot the data:

library(tidyr)
library(dplyr)

df %&gt;%
  pivot_longer(-row) %&gt;%
  mutate(grp = substr(name, 1L, 1L),
         value = replace(value, !(value[grp == &quot;b&quot;] %in% value[grp == &quot;a&quot;])[1:n()], NA), 
         grp = NULL,
         .by = row) %&gt;%
  pivot_wider()
  
# A tibble: 6 &#215; 7
    row    b1    b2    b3    a1    a2    a3
  &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
1     1    55    NA    NA    55    NA    NA
2    12    17    44    90    17    44    90
3    17    12    44    NA    12    44    NA
4    44    12    17    90    12    17    90
5    55     1    NA    NA     1    NA    NA
6    90    12    NA    44    12    28    44

If it's ok to check by column pairs, you could do:

df %&gt;%
  mutate(replace(pick(b1:b3), pick(a1:a3) != pick(b1:b3) | is.na(pick(a1:a3)), NA))

  row b1 b2 b3 a1 a2 a3
1   1 55 NA NA 55 NA NA
2  12 17 44 90 17 44 90
3  17 12 44 NA 12 44 NA
4  44 12 17 90 12 17 90
5  55  1 NA NA  1 NA NA
6  90 12 NA 44 12 28 44

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

更改多个列中的多个值的非逐行方法，如果它们存在于其他列中。

问题

答案1

答案2

Magrittr Pipes 的五种不同类型和使用案例是什么？ %>%，%<>%，%$%，%!>% 和 %T>%？

将数据框（df）的第一列转换为标题，并保留原始标题作为子标题。

在列表中查找向量元素并在数据框中收集信息。

为什么项目在从Makefile运行时无法访问renv()函数？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论