Why if I'm placing a lookbehind constraint on the capturing group, does it ensure compliance but also capture what is prior to the given constraint?

huangapple go评论70阅读模式
英文:

Why if I'm placing a lookbehind constraint on the capturing group, does it ensure compliance but also capture what is prior to the given constraint?

问题

正则表达式模式在Python中未捕获正确的子字符串,导致意外输出。

import re

#示例文本
input_text = "It is close to that place, the NY hospital was the place where I was born, the truth is that's all I know, and it happened in November of the year 2000."

#正则表达式模式
# partner_match = re.search(r"(?:(?:[^.,;\n]+)(?<=\s)|^)\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)", input_text, flags = re.IGNORECASE) # 这里我尝试使用否定运算符 [^...] 但它不起作用
partner_match = re.search(r"(?:(?:\.|,|;|\n)(?<=\s)|^)\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)", input_text, flags=re.IGNORECASE)

#这里打印捕获的字符串
if partner_match: print(partner_match.group(1))

为了修复正则表达式捕获的限制,您可以使用以下模式:

partner_match = re.search(r"(?:^|(?<=[.,;\\n]))\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)", input_text, flags=re.IGNORECASE)

这个正则表达式模式会捕获期望的子字符串"the NY hospital",而不是整个字符串。

英文:

Regex pattern in Python not capturing the correct substring, giving unexpected output

import re

#example
input_text = &quot;It is close to that place, the NY hospital was the place where I was born, the truth is that&#39;s all I know, and it happened in November of the year 2000.&quot;

#regex pattern
# partner_match = re.search(r&quot;(?:(?:[^.,;\n]+)(?&lt;=\s)|^)\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)&quot;, input_text, flags = re.IGNORECASE) # here I tried using the negation operator [^...] but it doesn&#39;t work
partner_match = re.search(r&quot;(?:(?:\.|,|;|\n)(?&lt;=\s)|^)\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)&quot;, input_text, flags=re.IGNORECASE)

#here print captured string
if partner_match: print(partner_match.group(1))

Why instead of giving me only this output :

the NY hospital

It incorrectly gives me all this string:

It is close to that place, the NY hospital

What should I fix in my regex capture restrictions?

答案1

得分: 1

正如评论中所解释的,正则表达式引擎首先尝试在字符串的开头进行匹配。在该位置未能匹配到 (?:\.|,|;|\n)(?<=\s) 后,它尝试匹配 ^ 并成功。因此,它不会再尝试匹配 (?:\.|,|;|\n)(?<=\s),因此出现了不希望的结果。

另外,值得注意的是,(?:\.|,|;|\n)(?<=\s) 是一个错误的结构,因为它只能匹配换行符 (\n),而不能匹配句点、逗号或分号。它的含义是,“匹配句点、逗号、分号或换行符,前面的字符必须是空白字符”,但显然在字符 X 之前的字符是 X 本身。在这里,换行符是这四个字符中唯一的空白字符。

另一个问题是,使用 \s* 而不是 \s+,正则表达式 \s*(?:was|would be|is)\s*the\s*(?:place|side) 可以匹配字符串 "wasthesize"(例如),我假设这是不希望的行为。

请注意,(?:\.|,|;|\n) 可以更简洁地表示为字符类:[.,;\n]

我假设问题是要匹配一个子字符串,该子字符串从句点、逗号、分号或空白字符后的空格开始,并持续到后面跟着 "was the place"、"was the side"、"would be the place"、"would be the side"、"is the place" 或 "is the side"。

如果字符串包含多个句点、逗号、分号或空白字符之前的空格,必须决定哪一个标识匹配字符串的开头。我假设是最后一个满足所有要求的配对。

因此,您可以尝试匹配以下正则表达式:

[.,;\n] (.+) +(?:was|would be|is) +the +(?:place|side)

其中所需的结果将包含在捕获组 1 中。

演示

或者,您可以使用正向后顾正向前瞻来简单匹配(但不捕获)所需的字符串:

(?<=[.,;\n] ).+(?= +(?:was|would be|is) +the +(?:place|side))

演示

英文:

As explained in the comments, the regex engine will initially attempt a match at the beginning of the string. After failing to match (?:\.|,|;|\n)(?&lt;=\s) at that location it attempts to match ^ and succeeds. It therefore will make no further attempts to match (?:\.|,|;|\n)(?&lt;=\s), hence the undesired result.


As an aside, note that (?:\.|,|;|\n)(?&lt;=\s) is a faulty construct as it can only match a newline (\n), never a period, comma or semicolon. It reads, "match a period, comma, semicolon or newline, provided the character following is preceded by a whitespace character", but of course the character that precedes the character that follows X is X itself. Here the newline is the only character of the four that is a whitespace character.

Another problem is that by using \s* rather than \s+ the string &quot;wasthesize&quot; (for example) is matched by the regular expression, \s*(?:was|would be|is)\s*the\s*(?:place|side), which I assume is an undesirable behaviour.

Note that (?:\.|,|;|\n) can be expressed more compactly as a character class: [.,;\n].


I assume the problem is to match a substring that begins after a space following a period, comma, semicolon or whitespace, and continues until it is followed by " was the place", " was the side", " would be the place", " would be the side", " is the place" or " is the side".

In the event that the string contains more than one space that is preceded by a period, comma, semicolon or whitespace, it must be decided which identifies the beginning of the matched string. I have assumed it is the last such pairing that meets all the requirements.


You therefore may attempt to match the following regular expression.

[.,;\n] (.+) +(?:was|would be|is) +the +(?:place|side)

where the desired result would be contained in capture group 1.

Demo


One may instead use a positive lookbehind and a positive lookahead to simply match (but not capture) the desired string.

(?&lt;=[.,;\n] ).+(?= +(?:was|would be|is) +the +(?:place|side))

Demo

huangapple
  • 本文由 发表于 2023年8月4日 04:04:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76831309.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定