英文:
Why if I'm placing a lookbehind constraint on the capturing group, does it ensure compliance but also capture what is prior to the given constraint?
问题
正则表达式模式在Python中未捕获正确的子字符串,导致意外输出。
import re
#示例文本
input_text = "It is close to that place, the NY hospital was the place where I was born, the truth is that's all I know, and it happened in November of the year 2000."
#正则表达式模式
# partner_match = re.search(r"(?:(?:[^.,;\n]+)(?<=\s)|^)\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)", input_text, flags = re.IGNORECASE) # 这里我尝试使用否定运算符 [^...] 但它不起作用
partner_match = re.search(r"(?:(?:\.|,|;|\n)(?<=\s)|^)\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)", input_text, flags=re.IGNORECASE)
#这里打印捕获的字符串
if partner_match: print(partner_match.group(1))
为了修复正则表达式捕获的限制,您可以使用以下模式:
partner_match = re.search(r"(?:^|(?<=[.,;\\n]))\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)", input_text, flags=re.IGNORECASE)
这个正则表达式模式会捕获期望的子字符串"the NY hospital",而不是整个字符串。
英文:
Regex pattern in Python not capturing the correct substring, giving unexpected output
import re
#example
input_text = "It is close to that place, the NY hospital was the place where I was born, the truth is that's all I know, and it happened in November of the year 2000."
#regex pattern
# partner_match = re.search(r"(?:(?:[^.,;\n]+)(?<=\s)|^)\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)", input_text, flags = re.IGNORECASE) # here I tried using the negation operator [^...] but it doesn't work
partner_match = re.search(r"(?:(?:\.|,|;|\n)(?<=\s)|^)\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)", input_text, flags=re.IGNORECASE)
#here print captured string
if partner_match: print(partner_match.group(1))
Why instead of giving me only this output :
the NY hospital
It incorrectly gives me all this string:
It is close to that place, the NY hospital
What should I fix in my regex capture restrictions?
答案1
得分: 1
正如评论中所解释的,正则表达式引擎首先尝试在字符串的开头进行匹配。在该位置未能匹配到 (?:\.|,|;|\n)(?<=\s)
后,它尝试匹配 ^
并成功。因此,它不会再尝试匹配 (?:\.|,|;|\n)(?<=\s)
,因此出现了不希望的结果。
另外,值得注意的是,(?:\.|,|;|\n)(?<=\s)
是一个错误的结构,因为它只能匹配换行符 (\n
),而不能匹配句点、逗号或分号。它的含义是,“匹配句点、逗号、分号或换行符,前面的字符必须是空白字符”,但显然在字符 X 之前的字符是 X 本身。在这里,换行符是这四个字符中唯一的空白字符。
另一个问题是,使用 \s*
而不是 \s+
,正则表达式 \s*(?:was|would be|is)\s*the\s*(?:place|side)
可以匹配字符串 "wasthesize"
(例如),我假设这是不希望的行为。
请注意,(?:\.|,|;|\n)
可以更简洁地表示为字符类:[.,;\n]
。
我假设问题是要匹配一个子字符串,该子字符串从句点、逗号、分号或空白字符后的空格开始,并持续到后面跟着 "was the place"、"was the side"、"would be the place"、"would be the side"、"is the place" 或 "is the side"。
如果字符串包含多个句点、逗号、分号或空白字符之前的空格,必须决定哪一个标识匹配字符串的开头。我假设是最后一个满足所有要求的配对。
因此,您可以尝试匹配以下正则表达式:
[.,;\n] (.+) +(?:was|would be|is) +the +(?:place|side)
其中所需的结果将包含在捕获组 1 中。
或者,您可以使用正向后顾和正向前瞻来简单匹配(但不捕获)所需的字符串:
(?<=[.,;\n] ).+(?= +(?:was|would be|is) +the +(?:place|side))
英文:
As explained in the comments, the regex engine will initially attempt a match at the beginning of the string. After failing to match (?:\.|,|;|\n)(?<=\s)
at that location it attempts to match ^
and succeeds. It therefore will make no further attempts to match (?:\.|,|;|\n)(?<=\s)
, hence the undesired result.
As an aside, note that (?:\.|,|;|\n)(?<=\s)
is a faulty construct as it can only match a newline (\n
), never a period, comma or semicolon. It reads, "match a period, comma, semicolon or newline, provided the character following is preceded by a whitespace character", but of course the character that precedes the character that follows X is X itself. Here the newline is the only character of the four that is a whitespace character.
Another problem is that by using \s*
rather than \s+
the string "wasthesize"
(for example) is matched by the regular expression, \s*(?:was|would be|is)\s*the\s*(?:place|side)
, which I assume is an undesirable behaviour.
Note that (?:\.|,|;|\n)
can be expressed more compactly as a character class: [.,;\n]
.
I assume the problem is to match a substring that begins after a space following a period, comma, semicolon or whitespace, and continues until it is followed by " was the place", " was the side", " would be the place", " would be the side", " is the place" or " is the side".
In the event that the string contains more than one space that is preceded by a period, comma, semicolon or whitespace, it must be decided which identifies the beginning of the matched string. I have assumed it is the last such pairing that meets all the requirements.
You therefore may attempt to match the following regular expression.
[.,;\n] (.+) +(?:was|would be|is) +the +(?:place|side)
where the desired result would be contained in capture group 1.
One may instead use a positive lookbehind and a positive lookahead to simply match (but not capture) the desired string.
(?<=[.,;\n] ).+(?= +(?:was|would be|is) +the +(?:place|side))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论