Python正则表达式提取数字之间的文本

huangapple go评论71阅读模式
英文:

Python Regex to extract text between numbers

问题

I'd like to extract the text between digits. For example, if I have text such as the following:

1964 ORDINARY shares
EXECUTORS OF JOANNA C RICHARDSON
100 ORDINARY shares
TG MARTIN
C MARTIN
7500 ORDINARY shares
ARCO LIMITED

I want to produce a list of 3 elements, where each element is the text between the numbers including the first number but not the end number, and the final element in the list where there is no end number:

[
'1964 ORDINARY shares \nEXECUTORS OF JOANNA C RICHARDSON',
'100 ORDINARY shares \nTG MARTIN\nC MARTIN\n',
'7500 ORDINARY shares\nARCO LIMITED'
]

I tried doing this:

regex = r'\d(.+?)\d'
re.findall(regex, a, re.DOTALL)

but it returned:

['9',
 ' ORDINARY shares\nEXECUTORS OF JOANNA C RICHARDSON\n',
 '0 ORDINARY shares\nTG MARTIN\nC MARTIN\n',
 '0']
英文:

I'd like to extract the text between digits. For example, if have text such as the following

1964 ORDINARY shares
EXECUTORS OF JOANNA C RICHARDSON
100 ORDINARY shares 
TG MARTIN
C MARTIN
7500 ORDINARY shares 
ARCO LIMITED

I want to produce a list of 3 elements, where each element is the text between the numbers including the first number but not the end number, and the final element in the list where there is no end number

[
'1964 ORDINARY shares \nEXECUTORS OF JOANNA C RICHARDSON',
'100 ORDINARY shares \nTG MARTIN\nC MARTIN\n',
'7500 ORDINARY shares\nARCO LIMITED'
]

I tried doing this

regex = r'\d(.+?)\d
re.findall(regex, a, re.DOTALL)

but it returned

['9',
 ' ORDINARY shares\nEXECUTORS OF JOANNA C RICHARDSON\n',
 '0 ORDINARY shares\nTG MARTIN\nC MARTIN\n',
 '0']

答案1

得分: 1

import re

text = """1964 ORDINARY shares
EXECUTORS OF JOANNA C RICHARDSON
100 ORDINARY shares
TG MARTIN
C MARTIN
7500 ORDINARY shares
ARCO LIMITED"""

使用正则表达式查找数字之间的文本

pattern = r'\d+.*?(?=\d|$)'
matches = re.findall(pattern, text, flags=re.DOTALL)

print(matches)

英文:

You can use the below code to achieve this.

import re

text = """1964 ORDINARY shares
EXECUTORS OF JOANNA C RICHARDSON
100 ORDINARY shares 
TG MARTIN
C MARTIN
7500 ORDINARY shares 
ARCO LIMITED"""

# Use regex to find the text between digits
pattern = r'\d+.*?(?=\d|$)'
matches = re.findall(pattern, text, flags=re.DOTALL)

print(matches)

答案2

得分: 1

The pattern \d(.+?)\d matches at least 3 characters, where the outer 2 digits are matched, and the inner part is captured in group 1 (where (.+?) matches at least 1 character).

You get those results because you are using a capture group with re.findall, which returns the value of the capture group.

So for example in 1964 you match 196, where 9 is captured in group 1, and that is the first value in your result.

There is a downvoted and removed answer by markalex and a comment by Michael Butscher that hold a key to use a pattern without re.DOTALL and a non-greedy quantifier.

\b\d+\b\D*

Explanation

  • \b\d+\b Match 1+ digits between word boundaries to prevent a partial word match.
  • \D* Match optional characters other than digits, including newlines.

Regex demo | Python demo

If the matches should be from the start of the string and be followed by a whitespace character, you might also consider using an anchor with re.M for multiline.

^\d+\s\D*

Regex demo | Python demo

英文:

The pattern \d(.+?)\d matches at least 3 characters, where the outer 2 digits are matched, and the inner part is captured in group 1 (where (.+?) matches at least 1 character)

You get those results because you are using a capture group with re.findall, which returns the value of the capture group.

So for example in 1964 you match 196, where 9 is captured in group 1 and that is the first value in your result.

There is a downvoted and removed answer by markalex and a comment by Michael Butscher that hold a key to use a pattern without re.DOTALL and a non greedy quantifier.

\b\d+\b\D*

Explanation

  • \b\d+\b Match 1+ digits between word boundaries to prevent a partial word match
  • \D* Match optional chars other than digits, including newlines

Regex demo | Python demo

If the matches should be from the start of the string and be followed by a whitespace char, you might also consider using an anchor with re.M for multiline

^\d+\s\D*

Regex demo | Pyton demo

huangapple
  • 本文由 发表于 2023年3月31日 19:25:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75897994.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定