2023年8月4日 04:04:25go评论96阅读模式

英文:

Why if I'm placing a lookbehind constraint on the capturing group, does it ensure compliance but also capture what is prior to the given constraint?

问题

正则表达式模式在Python中未捕获正确的子字符串，导致意外输出。

import re
#示例文本
input_text = "It is close to that place, the NY hospital was the place where I was born, the truth is that's all I know, and it happened in November of the year 2000."
#正则表达式模式
# partner_match = re.search(r"(?:(?:[^.,;\n]+)(?<=\s)|^)\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)", input_text, flags = re.IGNORECASE) # 这里我尝试使用否定运算符 [^...] 但它不起作用
partner_match = re.search(r"(?:(?:\.|,|;|\n)(?<=\s)|^)\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)", input_text, flags=re.IGNORECASE)
#这里打印捕获的字符串
if partner_match: print(partner_match.group(1))

为了修复正则表达式捕获的限制，您可以使用以下模式：

partner_match = re.search(r"(?:^|(?<=[.,;\\n]))\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)", input_text, flags=re.IGNORECASE)

这个正则表达式模式会捕获期望的子字符串"the NY hospital"，而不是整个字符串。

英文:

Regex pattern in Python not capturing the correct substring, giving unexpected output

import re
#example
input_text = &quot;It is close to that place, the NY hospital was the place where I was born, the truth is that&#39;s all I know, and it happened in November of the year 2000.&quot;
#regex pattern
# partner_match = re.search(r&quot;(?:(?:[^.,;\n]+)(?&lt;=\s)|^)\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)&quot;, input_text, flags = re.IGNORECASE) # here I tried using the negation operator [^...] but it doesn&#39;t work
partner_match = re.search(r&quot;(?:(?:\.|,|;|\n)(?&lt;=\s)|^)\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)&quot;, input_text, flags=re.IGNORECASE)
#here print captured string
if partner_match: print(partner_match.group(1))

Why instead of giving me only this output :

the NY hospital

It incorrectly gives me all this string:

It is close to that place, the NY hospital

What should I fix in my regex capture restrictions?

答案1

得分: 1

正如评论中所解释的，正则表达式引擎首先尝试在字符串的开头进行匹配。在该位置未能匹配到 (?:\.|,|;|\n)(?<=\s) 后，它尝试匹配 ^ 并成功。因此，它不会再尝试匹配 (?:\.|,|;|\n)(?<=\s)，因此出现了不希望的结果。

另外，值得注意的是，(?:\.|,|;|\n)(?<=\s) 是一个错误的结构，因为它只能匹配换行符 (\n)，而不能匹配句点、逗号或分号。它的含义是，“匹配句点、逗号、分号或换行符，前面的字符必须是空白字符”，但显然在字符 X 之前的字符是 X 本身。在这里，换行符是这四个字符中唯一的空白字符。

另一个问题是，使用 \s* 而不是 \s+，正则表达式 \s*(?:was|would be|is)\s*the\s*(?:place|side) 可以匹配字符串 "wasthesize"（例如），我假设这是不希望的行为。

请注意，(?:\.|,|;|\n) 可以更简洁地表示为字符类：[.,;\n]。

我假设问题是要匹配一个子字符串，该子字符串从句点、逗号、分号或空白字符后的空格开始，并持续到后面跟着 "was the place"、"was the side"、"would be the place"、"would be the side"、"is the place" 或 "is the side"。

如果字符串包含多个句点、逗号、分号或空白字符之前的空格，必须决定哪一个标识匹配字符串的开头。我假设是最后一个满足所有要求的配对。

因此，您可以尝试匹配以下正则表达式：

[.,;\n] (.+) +(?:was|would be|is) +the +(?:place|side)

其中所需的结果将包含在捕获组 1 中。

演示

或者，您可以使用正向后顾和正向前瞻来简单匹配（但不捕获）所需的字符串：

(?<=[.,;\n] ).+(?= +(?:was|would be|is) +the +(?:place|side))

演示

英文:

As explained in the comments, the regex engine will initially attempt a match at the beginning of the string. After failing to match (?:\.|,|;|\n)(?<=\s) at that location it attempts to match ^ and succeeds. It therefore will make no further attempts to match (?:\.|,|;|\n)(?<=\s), hence the undesired result.

As an aside, note that (?:\.|,|;|\n)(?<=\s) is a faulty construct as it can only match a newline (\n), never a period, comma or semicolon. It reads, "match a period, comma, semicolon or newline, provided the character following is preceded by a whitespace character", but of course the character that precedes the character that follows X is X itself. Here the newline is the only character of the four that is a whitespace character.

Another problem is that by using \s* rather than \s+ the string "wasthesize" (for example) is matched by the regular expression, \s*(?:was|would be|is)\s*the\s*(?:place|side), which I assume is an undesirable behaviour.

Note that (?:\.|,|;|\n) can be expressed more compactly as a character class: [.,;\n].

I assume the problem is to match a substring that begins after a space following a period, comma, semicolon or whitespace, and continues until it is followed by " was the place", " was the side", " would be the place", " would be the side", " is the place" or " is the side".

In the event that the string contains more than one space that is preceded by a period, comma, semicolon or whitespace, it must be decided which identifies the beginning of the matched string. I have assumed it is the last such pairing that meets all the requirements.

You therefore may attempt to match the following regular expression.

[.,;\n] (.+) +(?:was|would be|is) +the +(?:place|side)

where the desired result would be contained in capture group 1.

Demo

One may instead use a positive lookbehind and a positive lookahead to simply match (but not capture) the desired string.

(?&lt;=[.,;\n] ).+(?= +(?:was|would be|is) +the +(?:place|side))

Demo

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Why if I'm placing a lookbehind constraint on the capturing group, does it ensure compliance but also capture what is prior to the given constraint?

问题

答案1

在3D列表和numpy数组之间解包和赋值的奇怪情况

如何在Flask或Pandas中动态填充下拉菜单以显示来自CSV的列名？

在Python中，有没有一种方法可以在散点图上将静态标签放在数据点旁边？

无法在WSL中安装或升级Python 3.10.8。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。