英文:
Regex to ignore values that have a decimal point in front of value?
问题
我有一个数据集,看起来像这样:
> dput(test)
structure(list(Value = c("B20", "I82.B20", "B20, E88.1"), City = c("NY", "LA", "PA")), class = "data.frame", row.names = c(NA, -3L))
我想要提取那些具有相应 'value' 为 'B20' 的行,所以我有以下代码:
B20 <- test[grep(
"B20",
test$Value
),]
然而,我 **不** 想包括那些 'B20' 后面跟着小数点的行(例如第二行(*I82.B20*))。
以下输出应该如下所示:
> dput(B20)
structure(list(Value = c("B20", "B20, E88.1"), City = c("NY", "PA")), row.names = c(NA, 3L), class = "data.frame")
英文:
I have a dataset that looks like this:
> dput(test)
structure(list(Value = c("B20", "I82.B20", "B20, E88.1"), City = c("NY",
"LA", "PA")), class = "data.frame", row.names = c(NA, -3L))
I want to extract the rows that have a corresponding value
of 'B20', so I have the following code:
B20 <- test[grep(
"B20",
test$Value
),]
However, I do NOT want to include the rows where 'B20' is followed by a decimal point (such as row 2 (I82.B20)).
The following output should look like:
> dput(B20)
structure(list(Value = c("B20", "B20, E88.1"), City = c("NY", "PA")), row.names = c(NA, 3L), class = "data.frame")
答案1
得分: 1
明显的解决方案是选择包含 "B20" 但排除所有包含 ".B20" 的行:
test[grepl("B20", test$Value) & !grepl("\\.B20", test$Value),]
#> Value City
#> 1 B20 NY
#> 3 B20, E88.1 PA
尽管如果必须使用单个正则表达式,那么您可以匹配字符串的开头 ^
或任何不是句点的字符 [^\.]
,通过将这些可能性结合起来使用 (^|[^\.])
。然后匹配 B20
,为了安全起见,添加一个词边界 \b
。请注意,我们必须转义反斜杠,因此表达式将是:
test[grep("(^|[^\\\.])B20\\b", test$Value) ,]
#> Value City
#> 1 B20 NY
#> 3 B20, E88.1 PA
当我遇到像这样的正则表达式时,我需要花一点时间来理解它的运作方式,并考虑可能使它混淆的边缘情况,因此在实际代码中,即使第一个选项稍微不够 "聪明",我可能更喜欢它。
英文:
The obvious solution is to select rows containing "B20" but exclude all rows including ".B20"
test[grepl("B20", test$Value) & !grepl("\\.B20", test$Value),]
#> Value City
#> 1 B20 NY
#> 3 B20, E88.1 PA
Though if it has to be a single regex then you can match the start of the string ^
or any character that isn't a period [^\.]
by combining these possibilities with (^|[^\.])
. Then match B20
, and for safety add a word boundary \b
. Note we have to escape the backslashes, so the expression would be:
test[grep("(^|[^\\.])B20\\b", test$Value) ,]
#> Value City
#> 1 B20 NY
#> 3 B20, E88.1 PA
I have to spend a little bit of time when I come across a regex like this to understand what's going on and think through the possible edge cases that might confound it, so in actual code I might prefer the first option even if it is a bit less "clever".
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论