英文:
Can these pairs of regexes be simplified into one?
问题
我正在尝试从字符串中提取Twitter用户名。我的当前解决方案如下:
```python
def get_username(string):
p1 = re.compile(r'twitter\.com/([a-z0-9_\.\-]+)', re.IGNORECASE)
p2 = re.compile(r'twitter[\s\:@]+([a-z0-9_\.\-]+)', re.IGNORECASE)
match1 = re.search(p1, string)
match2 = re.search(p2, string)
if match1:
return match1.group(1)
elif match2:
return match2.group(1)
else:
return None
示例:
get_username("Twitter: https://twitter.com/foo123")
get_username("Twitter: twitter.com/foo123")
get_username("https://twitter.com/foo123")
get_username("https://twitter.com/foo123?blah")
get_username("Twitter foo123")
get_username("Twitter @foo123")
get_username("Twitter: foo123")
get_username("Twitter: foo123 | youtube: ...")
我想知道是否可以将我的两个正则表达式简化成一个。我的最佳尝试是:
pattern = re.compile(r'twitter(?:(?:\.com/)|(?:[\s\:@]+))([a-z0-9_\.\-]+)', re.IGNORECASE)
但这在第一个示例上失败,因为Twitter: https
在twitter.com/foo123
之前匹配。
<details>
<summary>英文:</summary>
I'm trying to fetch twitter usernames from strings. My current solution looks like this
def get_username(string):
p1 = re.compile(r'twitter.com/([a-z0-9_.-]+)', re.IGNORECASE)
p2 = re.compile(r'twitter[\s:@]+([a-z0-9_.-]+)', re.IGNORECASE)
match1 = re.search(p1, string)
match2 = re.search(p2, string)
if match1:
return match1.group(1)
elif match2:
return match2.group(1)
else:
return None
## Examples
get_username("Twitter: https://twitter.com/foo123")
get_username("Twitter: twitter.com/foo123")
get_username("https://twitter.com/foo123")
get_username("https://twitter.com/foo123?blah")
get_username("Twitter foo123")
get_username("Twitter @foo123")
get_username("Twitter: foo123")
get_username("Twitter: foo123 | youtube: ...")
I'm wondering if my two regexes can be simplified into one. My best attempt was
pattern = re.compile(r'twitter(?:(?:.com/)|(?:[\s:@]+))([a-z0-9_.-]+)', re.IGNORECASE)
but this fails on the first example because `Twitter: https` matches *before* `twitter.com/foo123`.
</details>
# 答案1
**得分**: 3
将贪婪量词 `.*` 添加到以下正则表达式模式 `'.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)'` 中,以跳过之前的(可选的)`twitter` 关键字并捕获最后一个:
```python
def get_username(string):
pat = re.compile(r'.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)', re.IGNORECASE)
if (match := pat.search(string)):
print(match.group(1))
return match.group(1)
return None
get_username("Twitter: https://twitter.com/foo123")
get_username("Twitter: twitter.com/foo123")
get_username("https://twitter.com/foo123")
get_username("https://twitter.com/foo123?blah")
get_username("Twitter foo123")
get_username("Twitter @foo123")
get_username("Twitter: foo123")
get_username("Twitter: foo123 | youtube: ...")
get_username("Twitt11er: foo123 | youtube: ...")
结果:
foo123
foo123
foo123
foo123
foo123
foo123
foo123
foo123
英文:
Add greedy quantifier .*
to the following regex pattern '.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)'
to skip previous (optional) twitter
keywords and catch the last one:
def get_username(string):
pat = re.compile(r'.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)', re.IGNORECASE)
if (match := pat.search(string)):
print(match.group(1))
return match.group(1)
return None
get_username("Twitter: https://twitter.com/foo123")
get_username("Twitter: twitter.com/foo123")
get_username("https://twitter.com/foo123")
get_username("https://twitter.com/foo123?blah")
get_username("Twitter foo123")
get_username("Twitter @foo123")
get_username("Twitter: foo123")
get_username("Twitter: foo123 | youtube: ...")
get_username("Twitt11er: foo123 | youtube: ...")
foo123
foo123
foo123
foo123
foo123
foo123
foo123
foo123
答案2
得分: 1
如果它始终以用户名结尾,只需使用 (\w+)$
def get_username(string):
if match1 := re.search(r'(\w+)$', string):
return match1.group(1)
return None
英文:
If it always end with the username, just use (\w+)$
def get_username(string):
if match1 := re.search(r'(\w+)$', string):
return match1.group(1)
return None
答案3
得分: 1
我会尝试负向先行断言(?!https?://)
来排除所有似乎以http://
或https://
开头的用户名。
twitter(?:(?:\.com/)|(?:[\s\:@]+))(?!https?://)([a-z0-9_\.\-]+)
英文:
I'd try a negative lookahead of (?!https?://)
to exclude all usernames which appear to start with http://
or https://
.
twitter(?:(?:\.com/)|(?:[\s\:@]+))(?!https?://)([a-z0-9_\.\-]+)
答案4
得分: 1
以下是代码部分的翻译:
import re
pattern = re.compile(r"\btwitter(?:\.com/|:?(?!\s*(?:https?://|twitter\b))\s+@?)([\w.-]+)", re.IGNORECASE)
def get_username(string):
m = pattern.search(string)
if m:
return m.group(1)
return None
print(get_username("Twitter: https://twitter.com/foo123"))
print(get_username("Twitter: twitter.com/foo123"))
print(get_username("https://twitter.com/foo123"))
print(get_username("https://twitter.com/foo123?blah"))
print(get_username("Twitter foo123"))
print(get_username("Twitter @foo123"))
print(get_username("Twitter: foo123"))
print(get_username("Twitter: foo123 | youtube: ..."))
输出:
foo123
foo123
foo123
foo123
foo123
foo123
foo123
foo123
希望这有帮助。如果有其他疑问,请随时提出。
英文:
If there can be multiple matches, you can use a negative lookahead to rule out twitter or http://
or https://
to the right, and get the capture group 1 value.
\btwitter(?:\.com/|(?!:?\s*(?:https?://|twitter\b)):?\s+@?)([\w.-]+)
Explanation
\btwitter
Match the word twitter(?:
Non capture group for the alternatives\.com/
Match.com/
|
Or(?!:?\s*(?:https?://|twitter\b))
Negative lookahead, assert not http:// or the word twitter preceded by an optional:
and whitspace chars directly to the right of the current position
:?\s+@?)
Match an optional:
1+ whitspace chars and optional @([\w.-]+)
Capture group 1, match 1+ of the listed characters
import re
pattern = re.compile(r"\btwitter(?:\.com/|:?(?!\s*(?:https?://|twitter\b))\s+@?)([\w.-]+)", re.IGNORECASE)
def get_username(string):
m = pattern.search(string)
if m:
return m.group(1)
return None
print(get_username("Twitter: https://twitter.com/foo123"))
print(get_username("Twitter: twitter.com/foo123"))
print(get_username("https://twitter.com/foo123"))
print(get_username("https://twitter.com/foo123?blah"))
print(get_username("Twitter foo123"))
print(get_username("Twitter @foo123"))
print(get_username("Twitter: foo123"))
print(get_username("Twitter: foo123 | youtube: ..."))
Output
foo123
foo123
foo123
foo123
foo123
foo123
foo123
foo123
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论