从一个单词中提取字符串中的数字

huangapple go评论102阅读模式
英文:

Extract digits from a string within a word

问题

  1. import re
  2. print(re.findall(r'\b\d\b', text))
英文:

I want a regular expression, which returns only digits, which are within a word, but I can only find expressions, which returns all digits in a string.

I've used this example:
text = 'I need this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.'

The following code returns all digits, but I am only interested in ['5', '3', '4']

import re
print(re.findall(r'\d+', text))

Any suggestions?

答案1

得分: 1

可以使用以下代码:

  1. re.findall(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])', text)

这个正则表达式将提取所有与ASCII字母紧随其后或紧随其前的一个或多个数字块。

对于Python re的完全Unicode版本,代码如下:

  1. (?<=[^\W\d_])\d+|\d+(?=[^\W\d_])

其中 [^\W\d_] 匹配任何Unicode字母。

详细信息请参阅正则表达式演示

英文:

You can use

  1. re.findall(r&#39;(?&lt;=[a-zA-Z])\d+|\d+(?=[a-zA-Z])&#39;, text)

This regex will extract all one or more digit chunks that are immediately preceded or followed with an ASCII letter.

A fully Unicode version for Python re would look like

  1. (?&lt;=[^\W\d_])\d+|\d+(?=[^\W\d_])

where [^\W\d_] matches any Unicode letter.

See the regex demo for reference.

答案2

得分: -1

  1. 使用 [`str.translate`][1] 的一种方法而不使用 *regex* `re` 模块
  2. ```python3
  3. from string import ascii_letters
  4. delete_dict = {sp_character: '&#39;&#39; for sp_character in ascii_letters}
  5. table = str.maketrans(delete_dict)
  6. text = '&#39;I 77! need 1:5 this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.&#39;
  7. print([res for s in text.rstrip('&#39;.&#39;).split()
  8. if not (s2 := s.rstrip(',&#39;)).isnumeric() and (res := s2.translate(table)) and res.isnumeric()])

输出:

  1. ['5', '3', '4']

性能

我很好奇,所以进行了一些基准测试,比较了与其他方法的性能。看起来 str.translate 甚至比正则表达式实现还要快。

这是我的带有 timeit 的基准代码:

  1. import re
  2. from string import ascii_letters
  3. from timeit import timeit
  4. _NUM_RE = re.compile(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])')
  5. delete_dict = {sp_character: '&#39;&#39; for sp_character in ascii_letters}
  6. _TABLE = str.maketrans(delete_dict)
  7. text = '&#39;I need this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.&#39;
  8. def main():
  9. n = 100_000
  10. print('regex: ', timeit("re.findall(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])', text)",
  11. globals=globals(), number=n))
  12. print('regex (opt): ', (timeit("_NUM_RE.findall(text)",
  13. globals=globals(), number=n)))
  14. print('iter_char: ', timeit(""
  15. k=set()
  16. for x in range(1,len(text)-1):
  17. if text[x-1].isdigit() and text[x].isalpha():
  18. k.add(text[x-1])
  19. if text[x].isdigit() and text[x+1].isalpha():
  20. k.add(text[x])
  21. if text[x-1].isalpha() and text[x].isdigit() and text[x+1].isalpha():
  22. k.add(text[x])
  23. if text[x-1].isalpha() and text[x].isdigit():
  24. k.add(text[x])
  25. "", globals=globals(), number=n))
  26. print('str.translate: ', timeit(""
  27. [
  28. res for s in text.rstrip('&#39;.&#39;).split()
  29. if not (s2 := s.rstrip(',&#39;)).isnumeric() and (res := s2.translate(_TABLE)) and res.isnumeric()
  30. ]
  31. "", globals=globals(), number=n))
  32. if __name__ == '__main__':
  33. main()

结果(Mac OS X - M1):

  1. regex: 0.5315765410050517
  2. regex (opt): 0.5069837079936406
  3. iter_char: 2.5037198749923846
  4. str.translate: 0.37348733299586456
  1. <details>
  2. <summary>英文:</summary>
  3. **An approach with [`str.translate`][1]**, without the use of *regex* or `re` module:
  4. ```python3
  5. from string import ascii_letters
  6. delete_dict = {sp_character: &#39;&#39; for sp_character in ascii_letters}
  7. table = str.maketrans(delete_dict)
  8. text = &#39;I 77! need 1:5 this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.&#39;
  9. print([res for s in text.rstrip(&#39;.&#39;).split()
  10. if not (s2 := s.rstrip(&#39;,&#39;)).isnumeric() and (res := s2.translate(table)) and res.isnumeric()])

Out:

  1. [&#39;5&#39;, &#39;3&#39;, &#39;4&#39;]

Performance

I was curious so I did some benchmark tests to compare performance against other approaches. Looks like str.translate is faster even than the regex implementation.

Here is my benchmark code with timeit:

  1. import re
  2. from string import ascii_letters
  3. from timeit import timeit
  4. _NUM_RE = re.compile(r&#39;(?&lt;=[a-zA-Z])\d+|\d+(?=[a-zA-Z])&#39;)
  5. delete_dict = {sp_character: &#39;&#39; for sp_character in ascii_letters}
  6. _TABLE = str.maketrans(delete_dict)
  7. text = &#39;I need this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.&#39;
  8. def main():
  9. n = 100_000
  10. print(&#39;regex: &#39;, timeit(&quot;re.findall(r&#39;(?&lt;=[a-zA-Z])\d+|\d+(?=[a-zA-Z])&#39;, text)&quot;,
  11. globals=globals(), number=n))
  12. print(&#39;regex (opt): &#39;, (timeit(&quot;_NUM_RE.findall(text)&quot;,
  13. globals=globals(), number=n)))
  14. print(&#39;iter_char: &#39;, timeit(&quot;&quot;&quot;
  15. k=set()
  16. for x in range(1,len(text)-1):
  17. if text[x-1].isdigit() and text[x].isalpha():
  18. k.add(text[x-1])
  19. if text[x].isdigit() and text[x+1].isalpha():
  20. k.add(text[x])
  21. if text[x-1].isalpha() and text[x].isdigit() and text[x+1].isalpha():
  22. k.add(text[x])
  23. if text[x-1].isalpha() and text[x].isdigit():
  24. k.add(text[x])
  25. &quot;&quot;&quot;, globals=globals(), number=n))
  26. print(&#39;str.translate: &#39;, timeit(&quot;&quot;&quot;
  27. [
  28. res for s in text.rstrip(&#39;.&#39;).split()
  29. if not (s2 := s.rstrip(&#39;,&#39;)).isnumeric() and (res := s2.translate(_TABLE)) and res.isnumeric()
  30. ]
  31. &quot;&quot;&quot;, globals=globals(), number=n))
  32. if __name__ == &#39;__main__&#39;:
  33. main()

Results (Mac OS X - M1):

  1. regex: 0.5315765410050517
  2. regex (opt): 0.5069837079936406
  3. iter_char: 2.5037198749923846
  4. str.translate: 0.37348733299586456

huangapple
  • 本文由 发表于 2023年3月15日 21:10:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75745192.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定