返回所有正则表达式匹配项,即使匹配项之间存在部分重叠。

huangapple go评论82阅读模式
英文:

Return all regex matches even when there is partial overlap in the matches

问题

我有一个正则表达式模式,用于在文本中查找多个单词并返回匹配的内容以及匹配前后的(最多)五个单词。

问题在于,如果在这个单词范围内正则表达式匹配多个词,只会返回第一个匹配项。
例如,以下正则表达式实际上是在查找单词 "book" 和 "page",并且在正则表达式之前和之后的部分也包括额外的单词。

以下示例只返回单个匹配项:

test_str <- "Made out of wood, a book can contain many pages that are used to transmit information."

my_regex <- "(?i)\b(?:\w+\W+){0,5}(\bbook?\w+|\bpage?\w+)\b(?:\W+\w+){0,5}"

stringr::str_extract_all(test_str, pattern = my_regex)

[[1]]
[1] "Made out of wood, a book can contain many pages that"

而我期望的是:

[[1]]
[1] "Made out of wood, a book can contain many pages that"
[2] "a book can contain many pages that are used to transmit"

(匹配项已突出显示)

我尝试使用正向先行断言来解决这个问题,但没有达到我想要的效果。
我应该如何修改我的正则表达式?

英文:

I have a regex pattern that looks for multiple words in a text and returns the match + (up to) five words that precede the match and the five words that follow the match.

The problem is that if within this range of words the regex matches multiple terms, only the first match will be returned.
For example, the following regex essentially looks for the words "book" and "page"
and the \\b(?:\\W+\\w+){0,5} part before and behind the regex also includes the extra words.

The following example only returns a single match:

test_str &lt;- &quot;Made out of wood, a book can contain many pages that are used to transmit information.&quot;

my_regex &lt;- &quot;(?i)\\b(?:\\w+\\W+){0,5}(\\bbook?\\w+|\\bpage?\\w+)\\b(?:\\W+\\w+){0,5}&quot;

stringr::str_extract_all(test_str, pattern = my_regex)


[[1]]
[1] &quot;Made out of wood, a book can contain many pages that&quot;

While I would expect:

[[1]]
[1] &quot;Made out of wood, a **book** can contain many pages that&quot;
[2] &quot;a book can contain many **pages** that are used to transmit&quot;

(Matches highlighted)

I tried to solve this by using a positive lookahead assertion but I did not get it to work as I wanted.
What can I do to modify my regex?

答案1

得分: 1

你可以将正则表达式拆分成几个部分,而不是使用或运算符"|":

test_str <- "Made out of wood, a book can contain many pages that are used to transmit information."

lr <- list()
lr[1] <- "(?i)\\b(?:\\w+\\W+){0,5}(\\bbook?\\w+)\\b(?:\\W+\\w+){0,5}"
lr[2] <- "(?i)\\b(?:\\w+\\W+){0,5}(\\bpage?\\w+)\\b(?:\\W+\\w+){0,5}"

sapply(lr, function(x) stringr::str_extract_all(test_str, pattern = x))

[[1]]
[1] "Made out of wood, a book can contain many pages that"

[[2]]
[1] "a book can contain many pages that are used to transmit"
英文:

You could split the regex into several bits instead of using the or operator "|"

test_str &lt;- &quot;Made out of wood, a book can contain many pages that are used to transmit information.&quot;

lr &lt;- list()
lr[1] &lt;- &quot;(?i)\\b(?:\\w+\\W+){0,5}(\\bbook?\\w+)\\b(?:\\W+\\w+){0,5}&quot;
lr[2] &lt;- &quot;(?i)\\b(?:\\w+\\W+){0,5}(\\bpage?\\w+)\\b(?:\\W+\\w+){0,5}&quot;

sapply(lr, function(x) stringr::str_extract_all(test_str, pattern = x))

[[1]]
[1] &quot;Made out of wood, a book can contain many pages that&quot;

[[2]]
[1] &quot;a book can contain many pages that are used to transmit&quot;

答案2

得分: 0

如下所示,在这个答案中,您可以将想要捕获的内容置于正向预查模式中,然后用capture.length属性替换match.length,以允许否则零长度的匹配实际上覆盖所捕获的内容。

当您希望匹配关键字前后的最多5个单词时,使用正向预查模式进行捕获会出现第二个问题,因为如果您仅使用简单的量词,如 (?:\\w+\\W+){0,5},那么与关键字相邻的5个单词内的每个单词都可以满足断言。相反,如果您只想在前面的单词从行的开头开始时匹配少于5个单词,请将 ^(?:\\w+\\W+){0,4} 作为备选模式包含进来。同样的思路适用于匹配关键字后面的最多5个单词:

test_str <- "Made out of wood, a book can contain many pages that are used to transmit information."

my_regex <- "(?i)(?=(\\b(?:(?:\\w+\\W+){5}|^(?:\\w+\\W+){0,4})(?:\\bbooks?|\\bpages?)\\b(?:(?:\\W+\\w+){5}|(?:\\W+\\w+){0,4}$)))"

m <- gregexpr(my_regex, test_str, perl=TRUE)
m <- lapply(m, function(i) {
       attr(i, "match.length") <- attr(i, "capture.length")
       i
     })
regmatches(test_str, m)

演示: https://ideone.com/wZPdd2

英文:

As shown in this answer, you can enclose what you want to capture in a positive lookahead pattern, and then replace match.length with the capture.length attribute to allow the otherwise zero-length match to actually cover what's captured.

A secondary problem arises when you use a lookahead pattern for captures because you want to match "up to" 5 words before and after a keyword, and every word within 5 words of the keyword can satisfy the assertion if you use only a simple quantifier like (?:\\w+\\W+){0,5}. Instead, since you only want to match less than 5 words before a keyword when the preceding words start from the beginning of a line, include ^(?:\\w+\\W+){0,4} as an alternation pattern. The same idea applies to matching "up to" 5 words that follow the keyword:

test_str <- "Made out of wood, a book can contain many pages that are used to transmit information."

my_regex &lt;- &quot;(?i)(?=(\\b(?:(?:\\w+\\W+){5}|^(?:\\w+\\W+){0,4})(?:\\bbooks?|\\bpages?)\\b(?:(?:\\W+\\w+){5}|(?:\\W+\\w+){0,4}$)))&quot;

m &lt;- gregexpr(my_regex, test_str, perl=TRUE)
m &lt;- lapply(m, function(i) {
       attr(i, &quot;match.length&quot;) &lt;- attr(i, &quot;capture.length&quot;)
       i
     })
regmatches(test_str, m)

Demo: https://ideone.com/wZPdd2

huangapple
  • 本文由 发表于 2023年6月8日 17:33:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76430468.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定