这些正则表达式对可以简化成一个吗?

huangapple go评论94阅读模式
英文:

Can these pairs of regexes be simplified into one?

问题

  1. 我正在尝试从字符串中提取Twitter用户名我的当前解决方案如下
  2. ```python
  3. def get_username(string):
  4. p1 = re.compile(r'twitter\.com/([a-z0-9_\.\-]+)', re.IGNORECASE)
  5. p2 = re.compile(r'twitter[\s\:@]+([a-z0-9_\.\-]+)', re.IGNORECASE)
  6. match1 = re.search(p1, string)
  7. match2 = re.search(p2, string)
  8. if match1:
  9. return match1.group(1)
  10. elif match2:
  11. return match2.group(1)
  12. else:
  13. return None

示例:

  1. get_username("Twitter: https://twitter.com/foo123")
  2. get_username("Twitter: twitter.com/foo123")
  3. get_username("https://twitter.com/foo123")
  4. get_username("https://twitter.com/foo123?blah")
  5. get_username("Twitter foo123")
  6. get_username("Twitter @foo123")
  7. get_username("Twitter: foo123")
  8. get_username("Twitter: foo123 | youtube: ...")

我想知道是否可以将我的两个正则表达式简化成一个。我的最佳尝试是:

  1. pattern = re.compile(r'twitter(?:(?:\.com/)|(?:[\s\:@]+))([a-z0-9_\.\-]+)', re.IGNORECASE)

但这在第一个示例上失败,因为Twitter: httpstwitter.com/foo123之前匹配。

  1. <details>
  2. <summary>英文:</summary>
  3. I&#39;m trying to fetch twitter usernames from strings. My current solution looks like this

def get_username(string):
p1 = re.compile(r'twitter.com/([a-z0-9_.-]+)', re.IGNORECASE)
p2 = re.compile(r'twitter[\s:@]+([a-z0-9_.-]+)', re.IGNORECASE)
match1 = re.search(p1, string)
match2 = re.search(p2, string)
if match1:
return match1.group(1)
elif match2:
return match2.group(1)
else:
return None

  1. ## Examples

get_username("Twitter: https://twitter.com/foo123&quot;)
get_username("Twitter: twitter.com/foo123")
get_username("https://twitter.com/foo123")
get_username("https://twitter.com/foo123?blah")
get_username("Twitter foo123")
get_username("Twitter @foo123")
get_username("Twitter: foo123")
get_username("Twitter: foo123 | youtube: ...")

  1. I&#39;m wondering if my two regexes can be simplified into one. My best attempt was

pattern = re.compile(r'twitter(?:(?:.com/)|(?:[\s:@]+))([a-z0-9_.-]+)', re.IGNORECASE)

  1. but this fails on the first example because `Twitter: https` matches *before* `twitter.com/foo123`.
  2. </details>
  3. # 答案1
  4. **得分**: 3
  5. 将贪婪量词 `.*` 添加到以下正则表达式模式 `&#39;.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)&#39;` 中,以跳过之前的(可选的)`twitter` 关键字并捕获最后一个:
  6. ```python
  7. def get_username(string):
  8. pat = re.compile(r&#39;.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)&#39;, re.IGNORECASE)
  9. if (match := pat.search(string)):
  10. print(match.group(1))
  11. return match.group(1)
  12. return None
  13. get_username(&quot;Twitter: https://twitter.com/foo123&quot;)
  14. get_username(&quot;Twitter: twitter.com/foo123&quot;)
  15. get_username(&quot;https://twitter.com/foo123&quot;)
  16. get_username(&quot;https://twitter.com/foo123?blah&quot;)
  17. get_username(&quot;Twitter foo123&quot;)
  18. get_username(&quot;Twitter @foo123&quot;)
  19. get_username(&quot;Twitter: foo123&quot;)
  20. get_username(&quot;Twitter: foo123 | youtube: ...&quot;)
  21. get_username(&quot;Twitt11er: foo123 | youtube: ...&quot;)

结果:

  1. foo123
  2. foo123
  3. foo123
  4. foo123
  5. foo123
  6. foo123
  7. foo123
  8. foo123
英文:

Add greedy quantifier .* to the following regex pattern &#39;.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)&#39; to skip previous (optional) twitter keywords and catch the last one:

  1. def get_username(string):
  2. pat = re.compile(r&#39;.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)&#39;, re.IGNORECASE)
  3. if (match := pat.search(string)):
  4. print(match.group(1))
  5. return match.group(1)
  6. return None
  7. get_username(&quot;Twitter: https://twitter.com/foo123&quot;)
  8. get_username(&quot;Twitter: twitter.com/foo123&quot;)
  9. get_username(&quot;https://twitter.com/foo123&quot;)
  10. get_username(&quot;https://twitter.com/foo123?blah&quot;)
  11. get_username(&quot;Twitter foo123&quot;)
  12. get_username(&quot;Twitter @foo123&quot;)
  13. get_username(&quot;Twitter: foo123&quot;)
  14. get_username(&quot;Twitter: foo123 | youtube: ...&quot;)
  15. get_username(&quot;Twitt11er: foo123 | youtube: ...&quot;)

  1. foo123
  2. foo123
  3. foo123
  4. foo123
  5. foo123
  6. foo123
  7. foo123
  8. foo123

答案2

得分: 1

如果它始终以用户名结尾,只需使用 (\w+)$

  1. def get_username(string):
  2. if match1 := re.search(r'(\w+)$', string):
  3. return match1.group(1)
  4. return None
英文:

If it always end with the username, just use (\w+)$

  1. def get_username(string):
  2. if match1 := re.search(r&#39;(\w+)$&#39;, string):
  3. return match1.group(1)
  4. return None

答案3

得分: 1

我会尝试负向先行断言(?!https?://)来排除所有似乎以http://https://开头的用户名。

  1. twitter(?:(?:\.com/)|(?:[\s\:@]+))(?!https?://)([a-z0-9_\.\-]+)

在regex101上尝试

英文:

I'd try a negative lookahead of (?!https?://) to exclude all usernames which appear to start with http:// or https://.

  1. twitter(?:(?:\.com/)|(?:[\s\:@]+))(?!https?://)([a-z0-9_\.\-]+)

Try on regex101

答案4

得分: 1

以下是代码部分的翻译:

  1. import re
  2. pattern = re.compile(r"\btwitter(?:\.com/|:?(?!\s*(?:https?://|twitter\b))\s+@?)([\w.-]+)", re.IGNORECASE)
  3. def get_username(string):
  4. m = pattern.search(string)
  5. if m:
  6. return m.group(1)
  7. return None
  8. print(get_username("Twitter: https://twitter.com/foo123"))
  9. print(get_username("Twitter: twitter.com/foo123"))
  10. print(get_username("https://twitter.com/foo123"))
  11. print(get_username("https://twitter.com/foo123?blah"))
  12. print(get_username("Twitter foo123"))
  13. print(get_username("Twitter @foo123"))
  14. print(get_username("Twitter: foo123"))
  15. print(get_username("Twitter: foo123 | youtube: ..."))

输出:

  1. foo123
  2. foo123
  3. foo123
  4. foo123
  5. foo123
  6. foo123
  7. foo123
  8. foo123

希望这有帮助。如果有其他疑问,请随时提出。

英文:

If there can be multiple matches, you can use a negative lookahead to rule out twitter or http:// or https:// to the right, and get the capture group 1 value.

  1. \btwitter(?:\.com/|(?!:?\s*(?:https?://|twitter\b)):?\s+@?)([\w.-]+)

Explanation

  • \btwitter Match the word twitter
  • (?: Non capture group for the alternatives
    • \.com/ Match .com/
    • | Or
    • (?!:?\s*(?:https?://|twitter\b)) Negative lookahead, assert not http:// or the word twitter preceded by an optional : and whitspace chars directly to the right of the current position
  • :?\s+@?) Match an optional : 1+ whitspace chars and optional @
  • ([\w.-]+) Capture group 1, match 1+ of the listed characters

Regex demo | Python demo

  1. import re
  2. pattern = re.compile(r&quot;\btwitter(?:\.com/|:?(?!\s*(?:https?://|twitter\b))\s+@?)([\w.-]+)&quot;, re.IGNORECASE)
  3. def get_username(string):
  4. m = pattern.search(string)
  5. if m:
  6. return m.group(1)
  7. return None
  8. print(get_username(&quot;Twitter: https://twitter.com/foo123&quot;))
  9. print(get_username(&quot;Twitter: twitter.com/foo123&quot;))
  10. print(get_username(&quot;https://twitter.com/foo123&quot;))
  11. print(get_username(&quot;https://twitter.com/foo123?blah&quot;))
  12. print(get_username(&quot;Twitter foo123&quot;))
  13. print(get_username(&quot;Twitter @foo123&quot;))
  14. print(get_username(&quot;Twitter: foo123&quot;))
  15. print(get_username(&quot;Twitter: foo123 | youtube: ...&quot;))

Output

  1. foo123
  2. foo123
  3. foo123
  4. foo123
  5. foo123
  6. foo123
  7. foo123
  8. foo123

huangapple
  • 本文由 发表于 2023年3月4日 02:12:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/75630544.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定