2023年5月10日 20:29:54go评论78阅读模式

英文:

Python regex positive lookahead cannot split correctly

问题

[^a-z]+ behaves like a lazy match because it's using the + quantifier after a negated character class. This means it will match the shortest possible sequence of characters that do not fall within the range of lowercase letters (a to z). To make it behave as a greedy match, you can use [^a-z]+ followed by + without the negated character class, like this: [^a-z]+.

Here's the corrected regular expression for splitting the text by sections:

re.split(r'\n(?=[A-Z\d])', text)

This regex uses a positive lookahead assertion to split the text at line breaks (\n) that are followed by an uppercase letter or a digit, which should correctly split the text into sections as you expected.

英文:

I've text consisting of sections. In each section:

The title is in uppercase and may span multiple lines
The body may have acronyms, so we cannot assume that uppercase words mark the start of each section

There may be zero or multiple line breaks between sections.

Example

import re

text = &quot;&quot;&quot;
Lorem ipsum

THIS SECTION IS A SHORT STORY
1 Hello world
2 Bye bye
Side comment


NEXT SECTION SPANS 200
YEARS AND MANY COUNTRIES!

3 Joe Bloggs attended a NATO summit
4 John Doe heard...
THIS SECTION HAS NO
LINE BREAK / SPACE FROM
THE PREVIOUS ONE

5 Alice thought...
6 Bob visited...
&quot;&quot;&quot;.strip()

re.split(&quot;\n(?=[^a-z]+\n+[a-z\d])&quot;, text)

I expected it to split the text by sections like this:

[&quot;Lorem ipsum\n&quot;,
 &quot;THIS SECTION IS A SHORT STORY\n1 Hello world\n2 Bye bye\nSide comment\n\n&quot;,
 &quot;NEXT SECTION SPANS 200\nYEARS AND MANY COUNTRIES!\n\n3 Joe Bloggs attended a NATO summit\n4 John Doe heard...&quot;,
 &quot;THIS SECTION HAS NO\nLINE BREAK / SPACE FROM\nTHE PREVIOUS ONE\n\n5 Alice thought...\n6 Bob visited...&quot;]

Instead, Python splits up each section as follows, which seems to contradict the lookahead assertion:

[&quot;Lorem ipsum&quot;,
 &quot;&quot;,
 &quot;THIS SECTION IS A SHORT STORY\n1 Hello world\n2 Bye bye\nSide comment&quot;,
 &quot;&quot;,
 &quot;&quot;,
 &quot;NEXT SECTION SPANS 200&quot;,
 &quot;YEARS AND MANY COUNTRIES!\n\n3 Joe Bloggs attended a NATO summit\n4 John Doe heard...&quot;,
 &quot;THIS SECTION HAS NO&quot;,
 &quot;LINE BREAK / SPACE FROM&quot;,
 &quot;THE PREVIOUS ONE\n\n5 Alice thought...\n6 Bob visited...&quot;]

Questions

Why does [^a-z]+ behave like a lazy match instead of greedy match?

What's the correct solution?

答案1

得分: 1

更新的示例

我们可以添加一个回顾来匹配双\n（或者在不需要尾随\n的情况下拆分\n\n），并在字符集中包含数字。

re.split(r"(?<=\n)\n(?=[A-Z0-9 ]+\n)", text)

或者 (?<=\n)\n(?= *[A-Z][A-Z0-9 ]*\n) 来强制至少有一个初始大写字母。

输出：

['Lorem ipsum\n',
 'THIS SECTION IS A SHORT STORY\n1 Hello world\n2 Bye bye\n',
 'THIS SECTION SPANS 200\nYEARS AND MANY COUNTRIES\n3 Joe Bloggs saw...\n4 John Doe heard...\n',
 'THIS SECTION IS ALSO A\nLONG STORY ABOUT EVERYTHING\nSINCE 1669\n\n5 Alice thought...\n6 Bob visited...']

正则表达式演示

使用循环

import re

out = ['']
prev_header = True
for line in text.splitlines():
    if line:
        header = bool(re.fullmatch('[^a-z]+', line))
        if header and not prev_header:
            out.append(line+'\n')
        else:
            out[-1] += line+'\n'
        prev_header = header

输出：

['Lorem ipsum\n',
 'THIS SECTION IS A SHORT STORY\n1 Hello world\n2 Bye bye\nSide comment\n',
 'NEXT SECTION SPANS 200\nYEARS AND MANY COUNTRIES!\n3 Joe Bloggs attended a NATO summit\n4 John Doe heard...\n',
 'THIS SECTION HAS NO\nLINE BREAK / SPACE FROM\nTHE PREVIOUS ONE\n5 Alice thought...\n6 Bob visited...\n']

英文:

updated example

We can add a lookbehind to match a double \n (or split on \n\n if you don't need the trailing \n), and include digits in the set of characters.

re.split(r&quot;(?&lt;=\n)\n(?=[A-Z0-9 ]+\n)&quot;, text)

Or (?<=\n)\n(?= *[A-Z][A-Z0-9 ]*\n) to force at least one initial uppercase.

Output:

[&#39;Lorem ipsum\n&#39;,
 &#39;THIS SECTION IS A SHORT STORY\n1 Hello world\n2 Bye bye\n&#39;,
 &#39;THIS SECTION SPANS 200\nYEARS AND MANY COUNTRIES\n3 Joe Bloggs saw...\n4 John Doe heard...\n&#39;,
 &#39;THIS SECTION IS ALSO A\nLONG STORY ABOUT EVERYTHING\nSINCE 1669\n\n5 Alice thought...\n6 Bob visited...&#39;]

regex demo

using a loop

import re

out = [&#39;&#39;]
prev_header = True
for line in text.splitlines():
    if line:
        header = bool(re.fullmatch(&#39;[^a-z]+&#39;, line))
        if header and not prev_header:
            out.append(line+&#39;\n&#39;)
        else:
            out[-1] += line+&#39;\n&#39;
        prev_header = header

Output:

[&#39;Lorem ipsum\n&#39;,
 &#39;THIS SECTION IS A SHORT STORY\n1 Hello world\n2 Bye bye\nSide comment\n&#39;,
 &#39;NEXT SECTION SPANS 200\nYEARS AND MANY COUNTRIES!\n3 Joe Bloggs attended a NATO summit\n4 John Doe heard...\n&#39;,
 &#39;THIS SECTION HAS NO\nLINE BREAK / SPACE FROM\nTHE PREVIOUS ONE\n5 Alice thought...\n6 Bob visited...\n&#39;]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python正则表达式的正向先行断言无法正确分割。

问题

Example

Questions

答案1

更新的示例

使用循环

updated example

using a loop

无法通过IP访问Google翻译API。

使用Python读取多个Uniswap代币的价格

合并字典以保留相同值以及不同值。

在启动时运行Python脚本。（Windows和Raspberry Pi操作系统）

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论