正则表达式以忽略在值前面带有小数点的部分?

huangapple go评论69阅读模式
英文:

Regex to ignore values that have a decimal point in front of value?

问题

我有一个数据集,看起来像这样:

    > dput(test)
    structure(list(Value = c("B20", "I82.B20", "B20, E88.1"), City = c("NY", "LA", "PA")), class = "data.frame", row.names = c(NA, -3L))

我想要提取那些具有相应 'value' 为 'B20' 的行,所以我有以下代码:

    B20 <- test[grep(
      "B20",
      test$Value
    ),]

然而,我 **不** 想包括那些 'B20' 后面跟着小数点的行(例如第二行(*I82.B20*))。

以下输出应该如下所示:

    > dput(B20)
    structure(list(Value = c("B20", "B20, E88.1"), City = c("NY", "PA")), row.names = c(NA, 3L), class = "data.frame")
英文:

I have a dataset that looks like this:

&gt; dput(test)
structure(list(Value = c(&quot;B20&quot;, &quot;I82.B20&quot;, &quot;B20, E88.1&quot;), City = c(&quot;NY&quot;, 
&quot;LA&quot;, &quot;PA&quot;)), class = &quot;data.frame&quot;, row.names = c(NA, -3L))

I want to extract the rows that have a corresponding value of 'B20', so I have the following code:

B20 &lt;- test[grep(
  &quot;B20&quot;,
  test$Value
),]

However, I do NOT want to include the rows where 'B20' is followed by a decimal point (such as row 2 (I82.B20)).

The following output should look like:

&gt; dput(B20)
structure(list(Value = c(&quot;B20&quot;, &quot;B20, E88.1&quot;), City = c(&quot;NY&quot;, &quot;PA&quot;)), row.names = c(NA, 3L), class = &quot;data.frame&quot;)

答案1

得分: 1

明显的解决方案是选择包含 "B20" 但排除所有包含 ".B20" 的行:

test[grepl("B20", test$Value) & !grepl("\\.B20", test$Value),]
#>        Value City
#> 1        B20   NY
#> 3 B20, E88.1   PA

尽管如果必须使用单个正则表达式,那么您可以匹配字符串的开头 ^ 或任何不是句点的字符 [^\.],通过将这些可能性结合起来使用 (^|[^\.])。然后匹配 B20,为了安全起见,添加一个词边界 \b。请注意,我们必须转义反斜杠,因此表达式将是:

test[grep("(^|[^\\\.])B20\\b", test$Value) ,]
#>        Value City
#> 1        B20   NY
#> 3 B20, E88.1   PA

当我遇到像这样的正则表达式时,我需要花一点时间来理解它的运作方式,并考虑可能使它混淆的边缘情况,因此在实际代码中,即使第一个选项稍微不够 "聪明",我可能更喜欢它。

英文:

The obvious solution is to select rows containing "B20" but exclude all rows including ".B20"

test[grepl(&quot;B20&quot;, test$Value) &amp; !grepl(&quot;\\.B20&quot;, test$Value),]
#&gt;        Value City
#&gt; 1        B20   NY
#&gt; 3 B20, E88.1   PA

Though if it has to be a single regex then you can match the start of the string ^ or any character that isn't a period [^\.] by combining these possibilities with (^|[^\.]). Then match B20, and for safety add a word boundary \b. Note we have to escape the backslashes, so the expression would be:

test[grep(&quot;(^|[^\\.])B20\\b&quot;, test$Value) ,]
#&gt;        Value City
#&gt; 1        B20   NY
#&gt; 3 B20, E88.1   PA

I have to spend a little bit of time when I come across a regex like this to understand what's going on and think through the possible edge cases that might confound it, so in actual code I might prefer the first option even if it is a bit less "clever".

huangapple
  • 本文由 发表于 2023年7月18日 04:42:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76707946.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定