2023年6月29日 21:57:33go评论97阅读模式

英文:

Where to place is.na() when using grepl to find non-missing non-matches?

问题

我想找到那些在var2中没有包含已定义表达式的非缺失（！）条目，并检索它们相应的var1值。默认情况下，grepl会返回缺失的条目，我想避免这种情况。我想到了两种方法，其中一种提供了错误的结果。我想了解为什么它提供了错误的结果？请查找下面的代码和带有输出的代码。谢谢！
## 正确结果
df$var1[!grepl(exp, df$var2, fixed=T) & !is.na(df$var2)]
# [1] 146 147 148 149 150
## 不正确的结果
df$var1[!grepl(exp, df$var2[!is.na(df$var2)], fixed=T)]
print(df[c(46, 47, 48, 49,  50,  96,  97,  98,  99, 100, 146, 147, 148, 149, 150),7])
# [1]  46  47  48  49  50  96  97  98  99 100 146 147 148 149 150
print(df[c(46, 47, 48, 49,  50,  96,  97,  98,  99, 100, 146, 147, 148, 149, 150),7])
# [1] NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    "CBA" "CBA" "CBA" "CBA" "CBA"

英文:

I want to find those non-missing (!) entries that do not contain a defined expression in var2 and retrieve their respective value of var1. By default, grepl will return also missing entries which I want to avoid. I came up with two approaches and one of them delivers wrong results. I would like to understand why it delivers wrong results? Find both the code and the code with output below, please. Thank you!

df &lt;- data.frame(
  var1 = 1:150,
  var2 = c(rep(NA, 100), rep(&quot;ABC&quot;, 45), rep(&quot;CBA&quot;, 5))
)  
exp &lt;- &quot;BC&quot;
## Correct results with
df$var1[!grepl(exp, df$var2, fixed=T) &amp; !is.na(df$var2)]
# [1] 146 147 148 149 150
## Incorrect results with
df$var1[!grepl(exp, df$var2[!is.na(df$var2)], fixed=T)]
print(df[c(46, 47, 48, 49,  50,  96,  97,  98,  99, 100, 146, 147, 148, 149, 150),7])
# [1]  46  47  48  49  50  96  97  98  99 100 146 147 148 149 150
print(df[c(46, 47, 48, 49,  50,  96,  97,  98,  99, 100, 146, 147, 148, 149, 150),7])
# [1] NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    &quot;CBA&quot; &quot;CBA&quot; &quot;CBA&quot; &quot;CBA&quot; &quot;CBA&quot;

答案1

得分: 4

你的问题是由于R的“recycling”机制引起的。通常情况下，通过一个较小的示例更容易看出问题 - 我们将使用15行而不是150行，并分别运行每个组件来查看发生了什么：

## 不错的小样本
df <- data.frame(
  var1 = 1:15,
  var2 = c(rep(NA, 10), rep("ABC", 4), rep("CBA", 1))
)  
## 这里是非缺失的var2值，共有5个
df$var2[!is.na(df$var2)]
# [1] "ABC" "ABC" "ABC" "ABC" "CBA"
## 当我们使用grep时，会得到5个TRUE/FALSE值
(!grepl(exp, df$var2[!is.na(df$var2)], fixed=T))
# [1] FALSE FALSE FALSE FALSE  TRUE
## 但var1的长度是多少？是15，而不是5
df$var1
# [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
## 当我们使用少于15个TRUE/FALSE值的长度-15向量索引时会发生什么？较小的向量会被“recycled”，也就是重复，直到它的长度达到较大向量的长度。
## 这里有几个使用长度为3的索引的示例
df$var1[c(T, F, F)]
# [1]  1  4  7 10 13
df$var1[c(F, T, F)]
# [1]  2  5  8 11 14
df$var1[c(T, T, F)]
# [1]  1  2  4  5  7  8 10 11 13 14
## 请记住，grepl的结果长度为5
!grepl(exp, df$var2[!is.na(df$var2)], fixed=T)
# [1] FALSE FALSE FALSE FALSE  TRUE
## 因此，它将被重复3次，直到达到长度15，
## 希望现在这个结果有意义了！
df$var1[!grepl(exp, df$var2[!is.na(df$var2)], fixed=T)]
# [1]  5 10 15

我认为使用这种子集方法很难得到正确的答案 - 你的第一种方法使用&是正确且合适的。

英文:

Your issue is due to R's "recycling". As usual, it's much easier to see with a smaller example - let's do 15 rows instead of 150, and we'll run each component separately to see what's going on:

## nice small sample
df &lt;- data.frame(
  var1 = 1:15,
  var2 = c(rep(NA, 10), rep(&quot;ABC&quot;, 4), rep(&quot;CBA&quot;, 1))
)  
## here&#39;s the non-missing var2 values, there are 5 of them
df$var2[!is.na(df$var2)]
# [1] &quot;ABC&quot; &quot;ABC&quot; &quot;ABC&quot; &quot;ABC&quot; &quot;CBA&quot;
## and when we grep them, we get 5 TRUE/FALSE values
(!grepl(exp, df$var2[!is.na(df$var2)], fixed=T))
# [1] FALSE FALSE FALSE FALSE  TRUE
## but how long is var1? 15, not 5
df$var1
# [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
## what happens when we index a length-15 vector by less than
## 15 TRUE/FALSE values? The smaller vector is &quot;recycled&quot;, that
## is, repeated until it gets to the length of the larger vector.
## Here&#39;s a couple examples of that with a length-3 index
df$var1[c(T, F, F)]
# [1]  1  4  7 10 13
df$var1[c(F, T, F)]
# [1]  2  5  8 11 14
df$var1[c(T, T, F)]
# [1]  1  2  4  5  7  8 10 11 13 14
## Remember that your grepl result is length 5
!grepl(exp, df$var2[!is.na(df$var2)], fixed=T)
# [1] FALSE FALSE FALSE FALSE  TRUE
## So it will be recycled 3 times up to length 15,
## and hopefully now this result makes sense!
df$var1[!grepl(exp, df$var2[!is.na(df$var2)], fixed=T)]
# [1]  5 10 15

I don't think there's a great way to get the right answer with this subset approach - your first approach with & is correct and proper.

答案2

得分: 1

regexpr() 在提供 `NA` 时返回 `NA`。
如果你想要那些**不匹配**表达式的元素，你可以这样做：
``` r
df <- data.frame(
  var1 = 1:150,
  var2 = c(rep(NA, 100), rep("ABC", 45), rep("CBA", 5))
)  
exp <- "BC"
df[regexpr(exp, df$var2, fixed = TRUE) %in% -1, ]
#>     var1 var2
#> 146  146  CBA
#> 147  147  CBA
#> 148  148  CBA
#> 149  149  CBA
#> 150  150  CBA

regexpr() 在无法匹配表达式时返回 -1，如果提供 NA 则返回 NA，如果找到匹配的表达式则返回位置。
如果你想要匹配的元素，可以使用：

!(regexpr(exp, df$var2, fixed = TRUE) %in% c(NA, -1))

会返回所有匹配且非缺失的元素。


<details>
<summary>英文:</summary>
`regexpr()` returns `NA`s if it is supplied `NA`s.
If you want elements that **don&#39;t match** your expression you can do:
``` r
df &lt;- data.frame(
  var1 = 1:150,
  var2 = c(rep(NA, 100), rep(&quot;ABC&quot;, 45), rep(&quot;CBA&quot;, 5))
)  
exp &lt;- &quot;BC&quot;
df[regexpr(exp, df$var2, fixed = TRUE) %in% -1, ]
#&gt;     var1 var2
#&gt; 146  146  CBA
#&gt; 147  147  CBA
#&gt; 148  148  CBA
#&gt; 149  149  CBA
#&gt; 150  150  CBA

<sup>Created on 2023-06-30 with reprex v2.0.2</sup>

regexpr() returns -1 when it can't match the expression, NA if given NA or the position at which it finds the matching expression.
If you wanted the elements that do match

!(regexpr(exp, df$var2, fixed = TRUE) %in% c(NA, -1))

would give all matched and non-missing elements.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在使用 grepl 查找非缺失的非匹配项时，要放置 is.na() 的位置在哪里？

问题

答案1

答案2

Plot Zoom with R

How to create a function which replace an empty values with the most appearing value or average value based on the specific columns

我怎样让glht函数打印使用的自由度？

babelquarto: 渲染多语言四开书

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。