英文:
Extract digits from a string within a word
问题
import re
print(re.findall(r'\b\d\b', text))
英文:
I want a regular expression, which returns only digits, which are within a word, but I can only find expressions, which returns all digits in a string.
I've used this example:
text = 'I need this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.'
The following code returns all digits, but I am only interested in ['5', '3', '4']
import re
print(re.findall(r'\d+', text))
Any suggestions?
答案1
得分: 1
可以使用以下代码:
re.findall(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])', text)
这个正则表达式将提取所有与ASCII字母紧随其后或紧随其前的一个或多个数字块。
对于Python re
的完全Unicode版本,代码如下:
(?<=[^\W\d_])\d+|\d+(?=[^\W\d_])
其中 [^\W\d_]
匹配任何Unicode字母。
详细信息请参阅正则表达式演示。
英文:
You can use
re.findall(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])', text)
This regex will extract all one or more digit chunks that are immediately preceded or followed with an ASCII letter.
A fully Unicode version for Python re
would look like
(?<=[^\W\d_])\d+|\d+(?=[^\W\d_])
where [^\W\d_]
matches any Unicode letter.
See the regex demo for reference.
答案2
得分: -1
使用 [`str.translate`][1] 的一种方法,而不使用 *regex* 或 `re` 模块:
```python3
from string import ascii_letters
delete_dict = {sp_character: ''' for sp_character in ascii_letters}
table = str.maketrans(delete_dict)
text = ''I 77! need 1:5 this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.'
print([res for s in text.rstrip(''.').split()
if not (s2 := s.rstrip(',')).isnumeric() and (res := s2.translate(table)) and res.isnumeric()])
输出:
['5', '3', '4']
性能
我很好奇,所以进行了一些基准测试,比较了与其他方法的性能。看起来 str.translate
甚至比正则表达式实现还要快。
这是我的带有 timeit
的基准代码:
import re
from string import ascii_letters
from timeit import timeit
_NUM_RE = re.compile(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])')
delete_dict = {sp_character: ''' for sp_character in ascii_letters}
_TABLE = str.maketrans(delete_dict)
text = ''I need this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.'
def main():
n = 100_000
print('regex: ', timeit("re.findall(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])', text)",
globals=globals(), number=n))
print('regex (opt): ', (timeit("_NUM_RE.findall(text)",
globals=globals(), number=n)))
print('iter_char: ', timeit(""
k=set()
for x in range(1,len(text)-1):
if text[x-1].isdigit() and text[x].isalpha():
k.add(text[x-1])
if text[x].isdigit() and text[x+1].isalpha():
k.add(text[x])
if text[x-1].isalpha() and text[x].isdigit() and text[x+1].isalpha():
k.add(text[x])
if text[x-1].isalpha() and text[x].isdigit():
k.add(text[x])
"", globals=globals(), number=n))
print('str.translate: ', timeit(""
[
res for s in text.rstrip(''.').split()
if not (s2 := s.rstrip(',')).isnumeric() and (res := s2.translate(_TABLE)) and res.isnumeric()
]
"", globals=globals(), number=n))
if __name__ == '__main__':
main()
结果(Mac OS X - M1):
regex: 0.5315765410050517
regex (opt): 0.5069837079936406
iter_char: 2.5037198749923846
str.translate: 0.37348733299586456
<details>
<summary>英文:</summary>
**An approach with [`str.translate`][1]**, without the use of *regex* or `re` module:
```python3
from string import ascii_letters
delete_dict = {sp_character: '' for sp_character in ascii_letters}
table = str.maketrans(delete_dict)
text = 'I 77! need 1:5 this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.'
print([res for s in text.rstrip('.').split()
if not (s2 := s.rstrip(',')).isnumeric() and (res := s2.translate(table)) and res.isnumeric()])
Out:
['5', '3', '4']
Performance
I was curious so I did some benchmark tests to compare performance against other approaches. Looks like str.translate
is faster even than the regex implementation.
Here is my benchmark code with timeit
:
import re
from string import ascii_letters
from timeit import timeit
_NUM_RE = re.compile(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])')
delete_dict = {sp_character: '' for sp_character in ascii_letters}
_TABLE = str.maketrans(delete_dict)
text = 'I need this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.'
def main():
n = 100_000
print('regex: ', timeit("re.findall(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])', text)",
globals=globals(), number=n))
print('regex (opt): ', (timeit("_NUM_RE.findall(text)",
globals=globals(), number=n)))
print('iter_char: ', timeit("""
k=set()
for x in range(1,len(text)-1):
if text[x-1].isdigit() and text[x].isalpha():
k.add(text[x-1])
if text[x].isdigit() and text[x+1].isalpha():
k.add(text[x])
if text[x-1].isalpha() and text[x].isdigit() and text[x+1].isalpha():
k.add(text[x])
if text[x-1].isalpha() and text[x].isdigit():
k.add(text[x])
""", globals=globals(), number=n))
print('str.translate: ', timeit("""
[
res for s in text.rstrip('.').split()
if not (s2 := s.rstrip(',')).isnumeric() and (res := s2.translate(_TABLE)) and res.isnumeric()
]
""", globals=globals(), number=n))
if __name__ == '__main__':
main()
Results (Mac OS X - M1):
regex: 0.5315765410050517
regex (opt): 0.5069837079936406
iter_char: 2.5037198749923846
str.translate: 0.37348733299586456
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论