这些正则表达式对可以简化成一个吗?

huangapple go评论72阅读模式
英文:

Can these pairs of regexes be simplified into one?

问题

我正在尝试从字符串中提取Twitter用户名我的当前解决方案如下

```python
def get_username(string):
    p1 = re.compile(r'twitter\.com/([a-z0-9_\.\-]+)', re.IGNORECASE)
    p2 = re.compile(r'twitter[\s\:@]+([a-z0-9_\.\-]+)', re.IGNORECASE)
    match1 = re.search(p1, string)
    match2 = re.search(p2, string)
    if match1:
        return match1.group(1)
    elif match2:
       return match2.group(1)
    else:
      return None

示例:

get_username("Twitter: https://twitter.com/foo123")
get_username("Twitter: twitter.com/foo123")
get_username("https://twitter.com/foo123")
get_username("https://twitter.com/foo123?blah")
get_username("Twitter foo123")
get_username("Twitter @foo123")
get_username("Twitter: foo123")
get_username("Twitter: foo123 | youtube: ...")

我想知道是否可以将我的两个正则表达式简化成一个。我的最佳尝试是:

pattern = re.compile(r'twitter(?:(?:\.com/)|(?:[\s\:@]+))([a-z0-9_\.\-]+)', re.IGNORECASE)

但这在第一个示例上失败,因为Twitter: httpstwitter.com/foo123之前匹配。


<details>
<summary>英文:</summary>

I&#39;m trying to fetch twitter usernames from strings. My current solution looks like this

def get_username(string):
p1 = re.compile(r'twitter.com/([a-z0-9_.-]+)', re.IGNORECASE)
p2 = re.compile(r'twitter[\s:@]+([a-z0-9_.-]+)', re.IGNORECASE)
match1 = re.search(p1, string)
match2 = re.search(p2, string)
if match1:
return match1.group(1)
elif match2:
return match2.group(1)
else:
return None


## Examples

get_username("Twitter: https://twitter.com/foo123&quot;)
get_username("Twitter: twitter.com/foo123")
get_username("https://twitter.com/foo123")
get_username("https://twitter.com/foo123?blah")
get_username("Twitter foo123")
get_username("Twitter @foo123")
get_username("Twitter: foo123")
get_username("Twitter: foo123 | youtube: ...")


I&#39;m wondering if my two regexes can be simplified into one. My best attempt was 

pattern = re.compile(r'twitter(?:(?:.com/)|(?:[\s:@]+))([a-z0-9_.-]+)', re.IGNORECASE)


but this fails on the first example because `Twitter: https` matches *before* `twitter.com/foo123`.

</details>


# 答案1
**得分**: 3

将贪婪量词 `.*` 添加到以下正则表达式模式 `&#39;.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)&#39;` 中,以跳过之前的(可选的)`twitter` 关键字并捕获最后一个:

```python
def get_username(string):
    pat = re.compile(r&#39;.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)&#39;, re.IGNORECASE)
    if (match := pat.search(string)):
        print(match.group(1))
        return match.group(1)
    return None

get_username(&quot;Twitter: https://twitter.com/foo123&quot;)
get_username(&quot;Twitter: twitter.com/foo123&quot;)
get_username(&quot;https://twitter.com/foo123&quot;)
get_username(&quot;https://twitter.com/foo123?blah&quot;)
get_username(&quot;Twitter foo123&quot;)
get_username(&quot;Twitter @foo123&quot;)
get_username(&quot;Twitter: foo123&quot;)
get_username(&quot;Twitter: foo123 | youtube: ...&quot;)
get_username(&quot;Twitt11er: foo123 | youtube: ...&quot;)

结果:

foo123
foo123
foo123
foo123
foo123
foo123
foo123
foo123
英文:

Add greedy quantifier .* to the following regex pattern &#39;.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)&#39; to skip previous (optional) twitter keywords and catch the last one:

def get_username(string):
    pat = re.compile(r&#39;.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)&#39;, re.IGNORECASE)
    if (match := pat.search(string)):
        print(match.group(1))
        return match.group(1)
    return None

get_username(&quot;Twitter: https://twitter.com/foo123&quot;)
get_username(&quot;Twitter: twitter.com/foo123&quot;)
get_username(&quot;https://twitter.com/foo123&quot;)
get_username(&quot;https://twitter.com/foo123?blah&quot;)
get_username(&quot;Twitter foo123&quot;)
get_username(&quot;Twitter @foo123&quot;)
get_username(&quot;Twitter: foo123&quot;)
get_username(&quot;Twitter: foo123 | youtube: ...&quot;)
get_username(&quot;Twitt11er: foo123 | youtube: ...&quot;)

foo123
foo123
foo123
foo123
foo123
foo123
foo123
foo123

答案2

得分: 1

如果它始终以用户名结尾,只需使用 (\w+)$

def get_username(string):
    if match1 := re.search(r'(\w+)$', string):
        return match1.group(1)
    return None
英文:

If it always end with the username, just use (\w+)$

def get_username(string):
    if match1 := re.search(r&#39;(\w+)$&#39;, string):
        return match1.group(1)
    return None

答案3

得分: 1

我会尝试负向先行断言(?!https?://)来排除所有似乎以http://https://开头的用户名。

twitter(?:(?:\.com/)|(?:[\s\:@]+))(?!https?://)([a-z0-9_\.\-]+)

在regex101上尝试

英文:

I'd try a negative lookahead of (?!https?://) to exclude all usernames which appear to start with http:// or https://.

twitter(?:(?:\.com/)|(?:[\s\:@]+))(?!https?://)([a-z0-9_\.\-]+)

Try on regex101

答案4

得分: 1

以下是代码部分的翻译:

import re

pattern = re.compile(r"\btwitter(?:\.com/|:?(?!\s*(?:https?://|twitter\b))\s+@?)([\w.-]+)", re.IGNORECASE)

def get_username(string):
    m = pattern.search(string)
    if m:
        return m.group(1)
    return None

print(get_username("Twitter: https://twitter.com/foo123"))
print(get_username("Twitter: twitter.com/foo123"))
print(get_username("https://twitter.com/foo123"))
print(get_username("https://twitter.com/foo123?blah"))
print(get_username("Twitter foo123"))
print(get_username("Twitter @foo123"))
print(get_username("Twitter: foo123"))
print(get_username("Twitter: foo123 | youtube: ..."))

输出:

foo123
foo123
foo123
foo123
foo123
foo123
foo123
foo123

希望这有帮助。如果有其他疑问,请随时提出。

英文:

If there can be multiple matches, you can use a negative lookahead to rule out twitter or http:// or https:// to the right, and get the capture group 1 value.

\btwitter(?:\.com/|(?!:?\s*(?:https?://|twitter\b)):?\s+@?)([\w.-]+)

Explanation

  • \btwitter Match the word twitter
  • (?: Non capture group for the alternatives
    • \.com/ Match .com/
    • | Or
    • (?!:?\s*(?:https?://|twitter\b)) Negative lookahead, assert not http:// or the word twitter preceded by an optional : and whitspace chars directly to the right of the current position
  • :?\s+@?) Match an optional : 1+ whitspace chars and optional @
  • ([\w.-]+) Capture group 1, match 1+ of the listed characters

Regex demo | Python demo

import re

pattern = re.compile(r&quot;\btwitter(?:\.com/|:?(?!\s*(?:https?://|twitter\b))\s+@?)([\w.-]+)&quot;, re.IGNORECASE)

def get_username(string):
    m = pattern.search(string)
    if m:
        return m.group(1)
    return None

print(get_username(&quot;Twitter: https://twitter.com/foo123&quot;))
print(get_username(&quot;Twitter: twitter.com/foo123&quot;))
print(get_username(&quot;https://twitter.com/foo123&quot;))
print(get_username(&quot;https://twitter.com/foo123?blah&quot;))
print(get_username(&quot;Twitter foo123&quot;))
print(get_username(&quot;Twitter @foo123&quot;))
print(get_username(&quot;Twitter: foo123&quot;))
print(get_username(&quot;Twitter: foo123 | youtube: ...&quot;))

Output

foo123
foo123
foo123
foo123
foo123
foo123
foo123
foo123

huangapple
  • 本文由 发表于 2023年3月4日 02:12:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/75630544.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定