2023年6月8日 17:33:51go评论130阅读模式

英文:

Return all regex matches even when there is partial overlap in the matches

问题

我有一个正则表达式模式，用于在文本中查找多个单词并返回匹配的内容以及匹配前后的（最多）五个单词。

问题在于，如果在这个单词范围内正则表达式匹配多个词，只会返回第一个匹配项。
例如，以下正则表达式实际上是在查找单词 "book" 和 "page"，并且在正则表达式之前和之后的部分也包括额外的单词。

以下示例只返回单个匹配项：

test_str <- "Made out of wood, a book can contain many pages that are used to transmit information."

my_regex <- "(?i)\b(?:\w+\W+){0,5}(\bbook?\w+|\bpage?\w+)\b(?:\W+\w+){0,5}"

stringr::str_extract_all(test_str, pattern = my_regex)

[[1]]
[1] "Made out of wood, a book can contain many pages that"

而我期望的是：

[[1]]
[1] "Made out of wood, a book can contain many pages that"
[2] "a book can contain many pages that are used to transmit"

（匹配项已突出显示）

我尝试使用正向先行断言来解决这个问题，但没有达到我想要的效果。
我应该如何修改我的正则表达式？

英文:

I have a regex pattern that looks for multiple words in a text and returns the match + (up to) five words that precede the match and the five words that follow the match.

The problem is that if within this range of words the regex matches multiple terms, only the first match will be returned.
For example, the following regex essentially looks for the words "book" and "page"
and the \\b(?:\\W+\\w+){0,5} part before and behind the regex also includes the extra words.

The following example only returns a single match:

test_str &lt;- &quot;Made out of wood, a book can contain many pages that are used to transmit information.&quot;
my_regex &lt;- &quot;(?i)\\b(?:\\w+\\W+){0,5}(\\bbook?\\w+|\\bpage?\\w+)\\b(?:\\W+\\w+){0,5}&quot;
stringr::str_extract_all(test_str, pattern = my_regex)
[[1]]
[1] &quot;Made out of wood, a book can contain many pages that&quot;

While I would expect:

[[1]]
[1] &quot;Made out of wood, a **book** can contain many pages that&quot;
[2] &quot;a book can contain many **pages** that are used to transmit&quot;

(Matches highlighted)

I tried to solve this by using a positive lookahead assertion but I did not get it to work as I wanted.
What can I do to modify my regex?

答案1

得分: 1

你可以将正则表达式拆分成几个部分，而不是使用或运算符"|"：

test_str <- "Made out of wood, a book can contain many pages that are used to transmit information."
lr <- list()
lr[1] <- "(?i)\\b(?:\\w+\\W+){0,5}(\\bbook?\\w+)\\b(?:\\W+\\w+){0,5}"
lr[2] <- "(?i)\\b(?:\\w+\\W+){0,5}(\\bpage?\\w+)\\b(?:\\W+\\w+){0,5}"
sapply(lr, function(x) stringr::str_extract_all(test_str, pattern = x))
[[1]]
[1] "Made out of wood, a book can contain many pages that"
[[2]]
[1] "a book can contain many pages that are used to transmit"

英文:

You could split the regex into several bits instead of using the or operator "|"

test_str &lt;- &quot;Made out of wood, a book can contain many pages that are used to transmit information.&quot;
lr &lt;- list()
lr[1] &lt;- &quot;(?i)\\b(?:\\w+\\W+){0,5}(\\bbook?\\w+)\\b(?:\\W+\\w+){0,5}&quot;
lr[2] &lt;- &quot;(?i)\\b(?:\\w+\\W+){0,5}(\\bpage?\\w+)\\b(?:\\W+\\w+){0,5}&quot;
sapply(lr, function(x) stringr::str_extract_all(test_str, pattern = x))
[[1]]
[1] &quot;Made out of wood, a book can contain many pages that&quot;
[[2]]
[1] &quot;a book can contain many pages that are used to transmit&quot;

答案2

得分: 0

如下所示，在这个答案中，您可以将想要捕获的内容置于正向预查模式中，然后用capture.length属性替换match.length，以允许否则零长度的匹配实际上覆盖所捕获的内容。

当您希望匹配关键字前后的最多5个单词时，使用正向预查模式进行捕获会出现第二个问题，因为如果您仅使用简单的量词，如 (?:\\w+\\W+){0,5}，那么与关键字相邻的5个单词内的每个单词都可以满足断言。相反，如果您只想在前面的单词从行的开头开始时匹配少于5个单词，请将 ^(?:\\w+\\W+){0,4} 作为备选模式包含进来。同样的思路适用于匹配关键字后面的最多5个单词：

test_str <- "Made out of wood, a book can contain many pages that are used to transmit information."
my_regex <- "(?i)(?=(\\b(?:(?:\\w+\\W+){5}|^(?:\\w+\\W+){0,4})(?:\\bbooks?|\\bpages?)\\b(?:(?:\\W+\\w+){5}|(?:\\W+\\w+){0,4}$)))"
m <- gregexpr(my_regex, test_str, perl=TRUE)
m <- lapply(m, function(i) {
       attr(i, "match.length") <- attr(i, "capture.length")
       i
     })
regmatches(test_str, m)

演示: https://ideone.com/wZPdd2

英文:

As shown in this answer, you can enclose what you want to capture in a positive lookahead pattern, and then replace match.length with the capture.length attribute to allow the otherwise zero-length match to actually cover what's captured.

A secondary problem arises when you use a lookahead pattern for captures because you want to match "up to" 5 words before and after a keyword, and every word within 5 words of the keyword can satisfy the assertion if you use only a simple quantifier like (?:\\w+\\W+){0,5}. Instead, since you only want to match less than 5 words before a keyword when the preceding words start from the beginning of a line, include ^(?:\\w+\\W+){0,4} as an alternation pattern. The same idea applies to matching "up to" 5 words that follow the keyword:

test_str <- "Made out of wood, a book can contain many pages that are used to transmit information."

my_regex &lt;- &quot;(?i)(?=(\\b(?:(?:\\w+\\W+){5}|^(?:\\w+\\W+){0,4})(?:\\bbooks?|\\bpages?)\\b(?:(?:\\W+\\w+){5}|(?:\\W+\\w+){0,4}$)))&quot;
m &lt;- gregexpr(my_regex, test_str, perl=TRUE)
m &lt;- lapply(m, function(i) {
       attr(i, &quot;match.length&quot;) &lt;- attr(i, &quot;capture.length&quot;)
       i
     })
regmatches(test_str, m)

Demo: https://ideone.com/wZPdd2

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

返回所有正则表达式匹配项，即使匹配项之间存在部分重叠。

问题

答案1

答案2

匹配Java中字符串内的浮点数数字，使用正则表达式。

从不一致的正则表达式模式中捕获值

如何在R中从数据框中删除科学计数法。

如何在 vscode 界面上以交互方式查看 R 中的颜色名称？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。