从一个单词中提取字符串中的数字

huangapple go评论67阅读模式
英文:

Extract digits from a string within a word

问题

import re
print(re.findall(r'\b\d\b', text))
英文:

I want a regular expression, which returns only digits, which are within a word, but I can only find expressions, which returns all digits in a string.

I've used this example:
text = 'I need this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.'

The following code returns all digits, but I am only interested in ['5', '3', '4']

import re
print(re.findall(r'\d+', text))

Any suggestions?

答案1

得分: 1

可以使用以下代码:

re.findall(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])', text)

这个正则表达式将提取所有与ASCII字母紧随其后或紧随其前的一个或多个数字块。

对于Python re的完全Unicode版本,代码如下:

(?<=[^\W\d_])\d+|\d+(?=[^\W\d_])

其中 [^\W\d_] 匹配任何Unicode字母。

详细信息请参阅正则表达式演示

英文:

You can use

re.findall(r&#39;(?&lt;=[a-zA-Z])\d+|\d+(?=[a-zA-Z])&#39;, text)

This regex will extract all one or more digit chunks that are immediately preceded or followed with an ASCII letter.

A fully Unicode version for Python re would look like

(?&lt;=[^\W\d_])\d+|\d+(?=[^\W\d_])

where [^\W\d_] matches any Unicode letter.

See the regex demo for reference.

答案2

得分: -1

使用 [`str.translate`][1] 的一种方法而不使用 *regex*`re` 模块

```python3
from string import ascii_letters

delete_dict = {sp_character: '&#39;&#39; for sp_character in ascii_letters}
table = str.maketrans(delete_dict)

text = '&#39;I 77! need 1:5 this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.&#39;

print([res for s in text.rstrip('&#39;.&#39;).split()
       if not (s2 := s.rstrip(',&#39;)).isnumeric() and (res := s2.translate(table)) and res.isnumeric()])

输出:

['5', '3', '4']

性能

我很好奇,所以进行了一些基准测试,比较了与其他方法的性能。看起来 str.translate 甚至比正则表达式实现还要快。

这是我的带有 timeit 的基准代码:

import re
from string import ascii_letters
from timeit import timeit


_NUM_RE = re.compile(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])')

delete_dict = {sp_character: '&#39;&#39; for sp_character in ascii_letters}
_TABLE = str.maketrans(delete_dict)

text = '&#39;I need this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.&#39;


def main():
    n = 100_000

    print('regex:         ', timeit("re.findall(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])', text)",
                 globals=globals(), number=n))

    print('regex (opt):   ', (timeit("_NUM_RE.findall(text)",
                 globals=globals(), number=n)))

    print('iter_char:     ', timeit(""
k=set()
for x in range(1,len(text)-1):
    if text[x-1].isdigit() and text[x].isalpha():
        k.add(text[x-1])
    if text[x].isdigit() and text[x+1].isalpha():
        k.add(text[x])
    if text[x-1].isalpha() and text[x].isdigit() and text[x+1].isalpha():
        k.add(text[x])
    if text[x-1].isalpha() and text[x].isdigit():
        k.add(text[x])
    "", globals=globals(), number=n))

    print('str.translate: ', timeit(""
[
    res for s in text.rstrip('&#39;.&#39;).split()
    if not (s2 := s.rstrip(',&#39;)).isnumeric() and (res := s2.translate(_TABLE)) and res.isnumeric()
]
    "", globals=globals(), number=n))


if __name__ == '__main__':
    main()

结果(Mac OS X - M1):

regex:          0.5315765410050517
regex (opt):    0.5069837079936406
iter_char:      2.5037198749923846
str.translate:  0.37348733299586456

<details>
<summary>英文:</summary>

**An approach with [`str.translate`][1]**, without the use of *regex* or `re` module:

```python3
from string import ascii_letters

delete_dict = {sp_character: &#39;&#39; for sp_character in ascii_letters}
table = str.maketrans(delete_dict)

text = &#39;I 77! need 1:5 this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.&#39;

print([res for s in text.rstrip(&#39;.&#39;).split()
       if not (s2 := s.rstrip(&#39;,&#39;)).isnumeric() and (res := s2.translate(table)) and res.isnumeric()])

Out:

[&#39;5&#39;, &#39;3&#39;, &#39;4&#39;]

Performance

I was curious so I did some benchmark tests to compare performance against other approaches. Looks like str.translate is faster even than the regex implementation.

Here is my benchmark code with timeit:

import re
from string import ascii_letters
from timeit import timeit


_NUM_RE = re.compile(r&#39;(?&lt;=[a-zA-Z])\d+|\d+(?=[a-zA-Z])&#39;)

delete_dict = {sp_character: &#39;&#39; for sp_character in ascii_letters}
_TABLE = str.maketrans(delete_dict)

text = &#39;I need this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.&#39;


def main():
    n = 100_000

    print(&#39;regex:         &#39;, timeit(&quot;re.findall(r&#39;(?&lt;=[a-zA-Z])\d+|\d+(?=[a-zA-Z])&#39;, text)&quot;,
                 globals=globals(), number=n))

    print(&#39;regex (opt):   &#39;, (timeit(&quot;_NUM_RE.findall(text)&quot;,
                 globals=globals(), number=n)))

    print(&#39;iter_char:     &#39;, timeit(&quot;&quot;&quot;
k=set()
for x in range(1,len(text)-1):
    if text[x-1].isdigit() and text[x].isalpha():
        k.add(text[x-1])
    if text[x].isdigit() and text[x+1].isalpha():
        k.add(text[x])
    if text[x-1].isalpha() and text[x].isdigit() and text[x+1].isalpha():
        k.add(text[x])
    if text[x-1].isalpha() and text[x].isdigit():
        k.add(text[x])
    &quot;&quot;&quot;, globals=globals(), number=n))

    print(&#39;str.translate: &#39;, timeit(&quot;&quot;&quot;
[
    res for s in text.rstrip(&#39;.&#39;).split()
    if not (s2 := s.rstrip(&#39;,&#39;)).isnumeric() and (res := s2.translate(_TABLE)) and res.isnumeric()
]
    &quot;&quot;&quot;, globals=globals(), number=n))


if __name__ == &#39;__main__&#39;:
    main()

Results (Mac OS X - M1):

regex:          0.5315765410050517
regex (opt):    0.5069837079936406
iter_char:      2.5037198749923846
str.translate:  0.37348733299586456

huangapple
  • 本文由 发表于 2023年3月15日 21:10:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75745192.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定