2023年3月31日 19:25:11go评论154阅读模式

英文:

Python Regex to extract text between numbers

问题

I'd like to extract the text between digits. For example, if I have text such as the following:

1964 ORDINARY shares
EXECUTORS OF JOANNA C RICHARDSON
100 ORDINARY shares
TG MARTIN
C MARTIN
7500 ORDINARY shares
ARCO LIMITED

I want to produce a list of 3 elements, where each element is the text between the numbers including the first number but not the end number, and the final element in the list where there is no end number:

[
'1964 ORDINARY shares \nEXECUTORS OF JOANNA C RICHARDSON',
'100 ORDINARY shares \nTG MARTIN\nC MARTIN\n',
'7500 ORDINARY shares\nARCO LIMITED'
]

I tried doing this:

regex = r'\d(.+?)\d'
re.findall(regex, a, re.DOTALL)

but it returned:

['9',
 ' ORDINARY shares\nEXECUTORS OF JOANNA C RICHARDSON\n',
 '0 ORDINARY shares\nTG MARTIN\nC MARTIN\n',
 '0']

英文:

I'd like to extract the text between digits. For example, if have text such as the following

1964 ORDINARY shares
EXECUTORS OF JOANNA C RICHARDSON
100 ORDINARY shares 
TG MARTIN
C MARTIN
7500 ORDINARY shares 
ARCO LIMITED

[
&#39;1964 ORDINARY shares \nEXECUTORS OF JOANNA C RICHARDSON&#39;,
&#39;100 ORDINARY shares \nTG MARTIN\nC MARTIN\n&#39;,
&#39;7500 ORDINARY shares\nARCO LIMITED&#39;
]

I tried doing this

regex = r&#39;\d(.+?)\d
re.findall(regex, a, re.DOTALL)

but it returned

[&#39;9&#39;,
 &#39; ORDINARY shares\nEXECUTORS OF JOANNA C RICHARDSON\n&#39;,
 &#39;0 ORDINARY shares\nTG MARTIN\nC MARTIN\n&#39;,
 &#39;0&#39;]

答案1

得分: 1

import re

text = """1964 ORDINARY shares
EXECUTORS OF JOANNA C RICHARDSON
100 ORDINARY shares
TG MARTIN
C MARTIN
7500 ORDINARY shares
ARCO LIMITED"""

使用正则表达式查找数字之间的文本

pattern = r'\d+.*?(?=\d|$)'
matches = re.findall(pattern, text, flags=re.DOTALL)

print(matches)

英文:

You can use the below code to achieve this.

import re

text = &quot;&quot;&quot;1964 ORDINARY shares
EXECUTORS OF JOANNA C RICHARDSON
100 ORDINARY shares 
TG MARTIN
C MARTIN
7500 ORDINARY shares 
ARCO LIMITED&quot;&quot;&quot;

# Use regex to find the text between digits
pattern = r&#39;\d+.*?(?=\d|$)&#39;
matches = re.findall(pattern, text, flags=re.DOTALL)

print(matches)

答案2

得分: 1

The pattern \d(.+?)\d matches at least 3 characters, where the outer 2 digits are matched, and the inner part is captured in group 1 (where (.+?) matches at least 1 character).

You get those results because you are using a capture group with re.findall, which returns the value of the capture group.

So for example in 1964 you match 196, where 9 is captured in group 1, and that is the first value in your result.

There is a downvoted and removed answer by markalex and a comment by Michael Butscher that hold a key to use a pattern without re.DOTALL and a non-greedy quantifier.

\b\d+\b\D*

Explanation

\b\d+\b Match 1+ digits between word boundaries to prevent a partial word match.
\D* Match optional characters other than digits, including newlines.

Regex demo | Python demo

If the matches should be from the start of the string and be followed by a whitespace character, you might also consider using an anchor with re.M for multiline.

^\d+\s\D*

Regex demo | Python demo

英文:

The pattern \d(.+?)\d matches at least 3 characters, where the outer 2 digits are matched, and the inner part is captured in group 1 (where (.+?) matches at least 1 character)

You get those results because you are using a capture group with re.findall, which returns the value of the capture group.

So for example in 1964 you match 196, where 9 is captured in group 1 and that is the first value in your result.

There is a downvoted and removed answer by markalex and a comment by Michael Butscher that hold a key to use a pattern without re.DOTALL and a non greedy quantifier.

\b\d+\b\D*

Explanation

\b\d+\b Match 1+ digits between word boundaries to prevent a partial word match
\D* Match optional chars other than digits, including newlines

Regex demo | Python demo

If the matches should be from the start of the string and be followed by a whitespace char, you might also consider using an anchor with re.M for multiline

^\d+\s\D*

Regex demo | Pyton demo

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python正则表达式提取数字之间的文本

问题

答案1

使用正则表达式查找数字之间的文本

答案2

为什么循环中的数值没有附加到数组中？

如何仅对重复的行进行排名，而不包括NaN值？

How can I efficiently create a new column in a pandas DataFrame based on another column's rolling mean over a period of 30 days?

如何将函数转换为lambda

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论