在使用 grepl 查找非缺失的非匹配项时,要放置 is.na() 的位置在哪里?

huangapple go评论97阅读模式
英文:

Where to place is.na() when using grepl to find non-missing non-matches?

问题

  1. 我想找到那些在var2中没有包含已定义表达式的非缺失(!)条目,并检索它们相应的var1值。默认情况下,grepl会返回缺失的条目,我想避免这种情况。我想到了两种方法,其中一种提供了错误的结果。我想了解为什么它提供了错误的结果?请查找下面的代码和带有输出的代码。谢谢!
  2. ## 正确结果
  3. df$var1[!grepl(exp, df$var2, fixed=T) & !is.na(df$var2)]
  4. # [1] 146 147 148 149 150
  5. ## 不正确的结果
  6. df$var1[!grepl(exp, df$var2[!is.na(df$var2)], fixed=T)]
  7. print(df[c(46, 47, 48, 49, 50, 96, 97, 98, 99, 100, 146, 147, 148, 149, 150),7])
  8. # [1] 46 47 48 49 50 96 97 98 99 100 146 147 148 149 150
  9. print(df[c(46, 47, 48, 49, 50, 96, 97, 98, 99, 100, 146, 147, 148, 149, 150),7])
  10. # [1] NA NA NA NA NA NA NA NA NA NA "CBA" "CBA" "CBA" "CBA" "CBA"
英文:

I want to find those non-missing (!) entries that do not contain a defined expression in var2 and retrieve their respective value of var1. By default, grepl will return also missing entries which I want to avoid. I came up with two approaches and one of them delivers wrong results. I would like to understand why it delivers wrong results? Find both the code and the code with output below, please. Thank you!

  1. df <- data.frame(
  2. var1 = 1:150,
  3. var2 = c(rep(NA, 100), rep("ABC", 45), rep("CBA", 5))
  4. )
  5. exp <- "BC"
  6. ## Correct results with
  7. df$var1[!grepl(exp, df$var2, fixed=T) & !is.na(df$var2)]
  8. # [1] 146 147 148 149 150
  9. ## Incorrect results with
  10. df$var1[!grepl(exp, df$var2[!is.na(df$var2)], fixed=T)]
  11. print(df[c(46, 47, 48, 49, 50, 96, 97, 98, 99, 100, 146, 147, 148, 149, 150),7])
  12. # [1] 46 47 48 49 50 96 97 98 99 100 146 147 148 149 150
  13. print(df[c(46, 47, 48, 49, 50, 96, 97, 98, 99, 100, 146, 147, 148, 149, 150),7])
  14. # [1] NA NA NA NA NA NA NA NA NA NA "CBA" "CBA" "CBA" "CBA" "CBA"

答案1

得分: 4

你的问题是由于R的“recycling”机制引起的。通常情况下,通过一个较小的示例更容易看出问题 - 我们将使用15行而不是150行,并分别运行每个组件来查看发生了什么:

  1. ## 不错的小样本
  2. df <- data.frame(
  3. var1 = 1:15,
  4. var2 = c(rep(NA, 10), rep("ABC", 4), rep("CBA", 1))
  5. )
  6. ## 这里是非缺失的var2值,共有5个
  7. df$var2[!is.na(df$var2)]
  8. # [1] "ABC" "ABC" "ABC" "ABC" "CBA"
  9. ## 当我们使用grep时,会得到5个TRUE/FALSE值
  10. (!grepl(exp, df$var2[!is.na(df$var2)], fixed=T))
  11. # [1] FALSE FALSE FALSE FALSE TRUE
  12. ## 但var1的长度是多少?是15,而不是5
  13. df$var1
  14. # [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  15. ## 当我们使用少于15个TRUE/FALSE值的长度-15向量索引时会发生什么?较小的向量会被“recycled”,也就是重复,直到它的长度达到较大向量的长度。
  16. ## 这里有几个使用长度为3的索引的示例
  17. df$var1[c(T, F, F)]
  18. # [1] 1 4 7 10 13
  19. df$var1[c(F, T, F)]
  20. # [1] 2 5 8 11 14
  21. df$var1[c(T, T, F)]
  22. # [1] 1 2 4 5 7 8 10 11 13 14
  23. ## 请记住,grepl的结果长度为5
  24. !grepl(exp, df$var2[!is.na(df$var2)], fixed=T)
  25. # [1] FALSE FALSE FALSE FALSE TRUE
  26. ## 因此,它将被重复3次,直到达到长度15,
  27. ## 希望现在这个结果有意义了!
  28. df$var1[!grepl(exp, df$var2[!is.na(df$var2)], fixed=T)]
  29. # [1] 5 10 15

我认为使用这种子集方法很难得到正确的答案 - 你的第一种方法使用&是正确且合适的。

英文:

Your issue is due to R's "recycling". As usual, it's much easier to see with a smaller example - let's do 15 rows instead of 150, and we'll run each component separately to see what's going on:

  1. ## nice small sample
  2. df &lt;- data.frame(
  3. var1 = 1:15,
  4. var2 = c(rep(NA, 10), rep(&quot;ABC&quot;, 4), rep(&quot;CBA&quot;, 1))
  5. )
  6. ## here&#39;s the non-missing var2 values, there are 5 of them
  7. df$var2[!is.na(df$var2)]
  8. # [1] &quot;ABC&quot; &quot;ABC&quot; &quot;ABC&quot; &quot;ABC&quot; &quot;CBA&quot;
  9. ## and when we grep them, we get 5 TRUE/FALSE values
  10. (!grepl(exp, df$var2[!is.na(df$var2)], fixed=T))
  11. # [1] FALSE FALSE FALSE FALSE TRUE
  12. ## but how long is var1? 15, not 5
  13. df$var1
  14. # [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  15. ## what happens when we index a length-15 vector by less than
  16. ## 15 TRUE/FALSE values? The smaller vector is &quot;recycled&quot;, that
  17. ## is, repeated until it gets to the length of the larger vector.
  18. ## Here&#39;s a couple examples of that with a length-3 index
  19. df$var1[c(T, F, F)]
  20. # [1] 1 4 7 10 13
  21. df$var1[c(F, T, F)]
  22. # [1] 2 5 8 11 14
  23. df$var1[c(T, T, F)]
  24. # [1] 1 2 4 5 7 8 10 11 13 14
  25. ## Remember that your grepl result is length 5
  26. !grepl(exp, df$var2[!is.na(df$var2)], fixed=T)
  27. # [1] FALSE FALSE FALSE FALSE TRUE
  28. ## So it will be recycled 3 times up to length 15,
  29. ## and hopefully now this result makes sense!
  30. df$var1[!grepl(exp, df$var2[!is.na(df$var2)], fixed=T)]
  31. # [1] 5 10 15

I don't think there's a great way to get the right answer with this subset approach - your first approach with &amp; is correct and proper.

答案2

得分: 1

  1. regexpr() 在提供 `NA` 时返回 `NA`
  2. 如果你想要那些**不匹配**表达式的元素,你可以这样做:
  3. ``` r
  4. df <- data.frame(
  5. var1 = 1:150,
  6. var2 = c(rep(NA, 100), rep("ABC", 45), rep("CBA", 5))
  7. )
  8. exp <- "BC"
  9. df[regexpr(exp, df$var2, fixed = TRUE) %in% -1, ]
  10. #> var1 var2
  11. #> 146 146 CBA
  12. #> 147 147 CBA
  13. #> 148 148 CBA
  14. #> 149 149 CBA
  15. #> 150 150 CBA

regexpr() 在无法匹配表达式时返回 -1,如果提供 NA 则返回 NA,如果找到匹配的表达式则返回位置。
如果你想要匹配的元素,可以使用:

  1. !(regexpr(exp, df$var2, fixed = TRUE) %in% c(NA, -1))

会返回所有匹配且非缺失的元素。

  1. <details>
  2. <summary>英文:</summary>
  3. `regexpr()` returns `NA`s if it is supplied `NA`s.
  4. If you want elements that **don&#39;t match** your expression you can do:
  5. ``` r
  6. df &lt;- data.frame(
  7. var1 = 1:150,
  8. var2 = c(rep(NA, 100), rep(&quot;ABC&quot;, 45), rep(&quot;CBA&quot;, 5))
  9. )
  10. exp &lt;- &quot;BC&quot;
  11. df[regexpr(exp, df$var2, fixed = TRUE) %in% -1, ]
  12. #&gt; var1 var2
  13. #&gt; 146 146 CBA
  14. #&gt; 147 147 CBA
  15. #&gt; 148 148 CBA
  16. #&gt; 149 149 CBA
  17. #&gt; 150 150 CBA

<sup>Created on 2023-06-30 with reprex v2.0.2</sup>

regexpr() returns -1 when it can't match the expression, NA if given NA or the position at which it finds the matching expression.
If you wanted the elements that do match

  1. !(regexpr(exp, df$var2, fixed = TRUE) %in% c(NA, -1))

would give all matched and non-missing elements.

huangapple
  • 本文由 发表于 2023年6月29日 21:57:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76581738.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定