How to write a Python regex that matches strings with both words and digits, excluding digits-only strings?

huangapple go评论58阅读模式
英文:

How to write a Python regex that matches strings with both words and digits, excluding digits-only strings?

问题

Sure, here's the translation of the requested text:

我想编写一个正则表达式,匹配可能包含单词和数字,但不仅包含数字的字符串。

我使用了这个正则表达式 [A-z+\d*],但它不起作用。

一些匹配的示例:

123expression
exp123ression

不匹配的示例:

你能帮我解决这个问题吗?非常感谢。

英文:

I want to write a regex that matches a string that may contain both words and digits and not digits only.

I used this regex [A-z+\d*], but it does not work.

Some matched samples:

expression123
123expression
exp123ression

Not matched sample:

1235234567544

Can you help me with this one? Thank you in advance

答案1

得分: 7

Lookarounds to the rescue!

^(?!\d+$)\w+$

This uses a negative lookahead construct and anchors, see a demo on regex101.com

<hr>

Note that you could have the same result with pure Python code alone:

samples = ["expression123", "123expression", "exp123ression", "1235234567544"]

filtered = [item for item in samples if not item.isdigit()]
print(filtered)

['expression123', '123expression', 'exp123ression']

See another demo on ideone.com.

With both approaches you wouldn't account for input strings like -1 or 1.0 (they'd be allowed).

<hr>

Tests

As the discussion somewhat arose, here's a small test suite for different sample sizes and expressions:

import string, random, re, timeit

class RegexTester():
samples = []
expressions_to_test = {"Cary": "^(?=.*\D)\w+$",
"Jan": "^(?!\d+$)\w+$"}

def __init__(self, sample_size=100, word_size=10, times=100):
	self.sample_size = sample_size
	self.word_size = word_size
	self.times = times

	# generate samples
	self.samples = [&quot;&quot;.join(random.choices(string.ascii_letters + string.digits, k=self.word_size))
					for _ in range(self.sample_size)]

	# compile the expressions in question
	for key, expression in self.expressions_to_test.items():
		self.expressions_to_test[key] = {&quot;raw&quot;: expression, &quot;compiled&quot;: re.compile(expression)}

def describe_sample(self):
	only_digits = [item for item in self.samples if all(char.isdigit() for char in item)]
	return only_digits

def test_expressions(self):

	def regex_test(samples, expr):
		return [expr.search(item) for item in samples]

	for key, values in self.expressions_to_test.items():
		t = timeit.Timer(lambda: regex_test(self.samples, values[&quot;compiled&quot;]))

		print(&quot;{key}, Times: {times}, Result: {result}&quot;.format(key=key,
															   times=self.times,
															   result=t.timeit(100)))

rt = RegexTester(sample_size=10 ** 5, word_size=10, times=10 ** 4)
#rt.describe_sample()
rt.test_expressions()

Which for a sample size of 10^5, a word size of 10 gave the comparable results for the both expressions:

Cary, Times: 10000, Result: 6.1406331
Jan, Times: 10000, Result: 5.948537699999999

When you set the sample size to 10^4 and the word size to 10^3, the result is the same:

Cary, Times: 10000, Result: 10.1723557
Jan, Times: 10000, Result: 9.697761900000001

You'll get significant differences when the strings consist only of numbers (aka the samples are generated only with numbers):

Cary, Times: 10000, Result: 25.4842013
Jan, Times: 10000, Result: 17.3708319

Note that this is randomly generated text and due to the method of generating it, the longer the strings are, the less likely they are to consist only of numbers. In the end it will depend on the actual text inputs.

英文:

Lookarounds to the rescue!

^(?!\d+$)\w+$

This uses a negative lookahead construct and anchors, see a demo on regex101.com

<hr>

Note that you could have the same result with pure Python code alone:

samples = [&quot;expression123&quot;, &quot;123expression&quot;, &quot;exp123ression&quot;, &quot;1235234567544&quot;]
 
filtered = [item for item in samples if not item.isdigit()]
print(filtered)

# [&#39;expression123&#39;, &#39;123expression&#39;, &#39;exp123ression&#39;]

See another demo on ideone.com.

With both approaches you wouldn't account for input strings like -1 or 1.0 (they'd be allowed).

<hr>

Tests

As the discussion somewhat arose, here's a small test suite for different sample sizes and expressions:

import string, random, re, timeit


class RegexTester():
	samples = []
	expressions_to_test = {&quot;Cary&quot;: &quot;^(?=.*\D)\w+$&quot;,
						   &quot;Jan&quot;: &quot;^(?!\d+$)\w+$&quot;}

	def __init__(self, sample_size=100, word_size=10, times=100):
		self.sample_size = sample_size
		self.word_size = word_size
		self.times = times

		# generate samples
		self.samples = [&quot;&quot;.join(random.choices(string.ascii_letters + string.digits, k=self.word_size))
						for _ in range(self.sample_size)]

		# compile the expressions in question
		for key, expression in self.expressions_to_test.items():
			self.expressions_to_test[key] = {&quot;raw&quot;: expression, &quot;compiled&quot;: re.compile(expression)}

	def describe_sample(self):
		only_digits = [item for item in self.samples if all(char.isdigit() for char in item)]
		return only_digits

	def test_expressions(self):

		def regex_test(samples, expr):
			return [expr.search(item) for item in samples]

		for key, values in self.expressions_to_test.items():
			t = timeit.Timer(lambda: regex_test(self.samples, values[&quot;compiled&quot;]))

			print(&quot;{key}, Times: {times}, Result: {result}&quot;.format(key=key,
																   times=self.times,
																   result=t.timeit(100)))


rt = RegexTester(sample_size=10 ** 5, word_size=10, times=10 ** 4)
#rt.describe_sample()
rt.test_expressions()

Which for a sample size of 10^5, a word size of 10 gave the comparable results for the both expressions:

Cary, Times: 10000, Result: 6.1406331
Jan, Times: 10000, Result: 5.948537699999999

When you set the sample size to 10^4 and the word size to 10^3, the result is the same:

Cary, Times: 10000, Result: 10.1723557
Jan, Times: 10000, Result: 9.697761900000001

You'll get significant differences when the strings consist only of numbers (aka the samples are generated only with numbers):

Cary, Times: 10000, Result: 25.4842013
Jan, Times: 10000, Result: 17.3708319

Note that this is randomly generated text and due to the method of generating it, the longer the strings are, the less likely they are to consist only of numbers. In the end it will depend on the actual text inputs.

答案2

得分: 2

另一种解决方案:只需在字符串中搜索除数字以外的其他字符:

import re

data = [
'expression123',
'123expression',
'exp123ression',
'1235234567544'
]

for t in data:
    m = re.search(r'\D', t)
    if m:
        print(t)

打印:

expression123
123expression
exp123ression
英文:

Another solution: simply search for other character than digit in your string:

import re

data = [
&#39;expression123&#39;,
&#39;123expression&#39;,
&#39;exp123ression&#39;,
&#39;1235234567544&#39;
]

for t in data:
	m = re.search(r&#39;\D&#39;, t)
	if m:
		print(t)

Prints:

expression123
123expression
exp123ression

答案3

得分: 2

你可以尝试匹配以下正则表达式。

^(?:\w*[a-zA-Z_]\w*)?$

演示

这匹配空字符串。如果字符串必须至少包含一个字符,则可以简化为

^\w*[a-zA-Z_]\w*$
英文:

You may attempt to match the following regular expression.

^(?:\w*[a-zA-Z_]\w*)?$

Demo

This matches empty strings. If the string must contain at least one character this can be simplified to

^\w*[a-zA-Z_]\w*$

答案4

得分: 2

请注意,[A-z] 匹配的内容更多[A-Za-z] 多。

如果您想在 Python 3 中检查 alnum 而不仅仅是数字:

strings = [
    &quot;expression123&quot;,
    &quot;123expression&quot;,
    &quot;exp123ression&quot;,
    &quot;1235234567544&quot;,
]

for s in strings:
    if not s.isnumeric() and s.isalnum():
        print(s)

输出

expression123
123expression
exp123ression

注意.isnumeric().isalnum() 都对 unicode 有效。

英文:

Note that [A-z] matches more than [A-Za-z]

If you want to check for alnum and not only digits in Python 3:

strings = [
    &quot;expression123&quot;,
    &quot;123expression&quot;,
    &quot;exp123ression&quot;,
    &quot;1235234567544&quot;,
]

for s in strings:
    if not s.isnumeric() and s.isalnum():
        print(s)

Output

expression123
123expression
exp123ression

Note that both .isnumeric() and .isalnum() are unicode aware:

答案5

得分: 0

尝试这个:

import re

regex = r'^(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]+$'

strings = ['expression123', '123expression', 'exp123ression', '1235234567544']

for string in strings:
    if re.match(regex, string):
        print(f'Matched: {string}')
    else:
        print(f'Not matched: {string}')

这将产生以下结果:

匹配: expression123
匹配: 123expression
匹配: exp123ression
未匹配: 1235234567544
英文:

try this:

 import re

regex = r&#39;^(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]+$&#39;

strings = [&#39;expression123&#39;, &#39;123expression&#39;, &#39;exp123ression&#39;, &#39;1235234567544&#39;]

for string in strings:
    if re.match(regex, string):
        print(f&#39;Matched: {string}&#39;)
    else:
        print(f&#39;Not matched: {string}&#39;)

this would give

Matched: expression123
Matched: 123expression
Matched: exp123ression
Not matched: 1235234567544

huangapple
  • 本文由 发表于 2023年5月21日 03:41:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76297052.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定