2023年3月4日 02:12:54go评论94阅读模式

英文:

Can these pairs of regexes be simplified into one?

问题

我正在尝试从字符串中提取Twitter用户名。我的当前解决方案如下：
```python
def get_username(string):
    p1 = re.compile(r'twitter\.com/([a-z0-9_\.\-]+)', re.IGNORECASE)
    p2 = re.compile(r'twitter[\s\:@]+([a-z0-9_\.\-]+)', re.IGNORECASE)
    match1 = re.search(p1, string)
    match2 = re.search(p2, string)
    if match1:
        return match1.group(1)
    elif match2:
       return match2.group(1)
    else:
      return None

示例：

get_username("Twitter: https://twitter.com/foo123")
get_username("Twitter: twitter.com/foo123")
get_username("https://twitter.com/foo123")
get_username("https://twitter.com/foo123?blah")
get_username("Twitter foo123")
get_username("Twitter @foo123")
get_username("Twitter: foo123")
get_username("Twitter: foo123 | youtube: ...")

我想知道是否可以将我的两个正则表达式简化成一个。我的最佳尝试是：

pattern = re.compile(r'twitter(?:(?:\.com/)|(?:[\s\:@]+))([a-z0-9_\.\-]+)', re.IGNORECASE)

但这在第一个示例上失败，因为Twitter: https在twitter.com/foo123之前匹配。


<details>
<summary>英文:</summary>
I&#39;m trying to fetch twitter usernames from strings. My current solution looks like this

def get_username(string):
p1 = re.compile(r'twitter.com/([a-z0-9_.-]+)', re.IGNORECASE)
p2 = re.compile(r'twitter[\s:@]+([a-z0-9_.-]+)', re.IGNORECASE)
match1 = re.search(p1, string)
match2 = re.search(p2, string)
if match1:
return match1.group(1)
elif match2:
return match2.group(1)
else:
return None


## Examples

get_username("Twitter: https://twitter.com/foo123")
get_username("Twitter: twitter.com/foo123")
get_username("https://twitter.com/foo123")
get_username("https://twitter.com/foo123?blah")
get_username("Twitter foo123")
get_username("Twitter @foo123")
get_username("Twitter: foo123")
get_username("Twitter: foo123 | youtube: ...")


I&#39;m wondering if my two regexes can be simplified into one. My best attempt was

pattern = re.compile(r'twitter(?:(?:.com/)|(?:[\s:@]+))([a-z0-9_.-]+)', re.IGNORECASE)


but this fails on the first example because `Twitter: https` matches *before* `twitter.com/foo123`.
</details>
# 答案1
**得分**: 3
将贪婪量词 `.*` 添加到以下正则表达式模式 `&#39;.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)&#39;` 中，以跳过之前的（可选的）`twitter` 关键字并捕获最后一个：
```python
def get_username(string):
    pat = re.compile(r&#39;.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)&#39;, re.IGNORECASE)
    if (match := pat.search(string)):
        print(match.group(1))
        return match.group(1)
    return None
get_username(&quot;Twitter: https://twitter.com/foo123&quot;)
get_username(&quot;Twitter: twitter.com/foo123&quot;)
get_username(&quot;https://twitter.com/foo123&quot;)
get_username(&quot;https://twitter.com/foo123?blah&quot;)
get_username(&quot;Twitter foo123&quot;)
get_username(&quot;Twitter @foo123&quot;)
get_username(&quot;Twitter: foo123&quot;)
get_username(&quot;Twitter: foo123 | youtube: ...&quot;)
get_username(&quot;Twitt11er: foo123 | youtube: ...&quot;)

结果：

foo123
foo123
foo123
foo123
foo123
foo123
foo123
foo123

英文:

Add greedy quantifier .* to the following regex pattern '.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)' to skip previous (optional) twitter keywords and catch the last one:

def get_username(string):
    pat = re.compile(r&#39;.*twitter(?:(?:\.com/)|(?::?\s+@?))([a-z0-9_\.\-]+)&#39;, re.IGNORECASE)
    if (match := pat.search(string)):
        print(match.group(1))
        return match.group(1)
    return None
get_username(&quot;Twitter: https://twitter.com/foo123&quot;)
get_username(&quot;Twitter: twitter.com/foo123&quot;)
get_username(&quot;https://twitter.com/foo123&quot;)
get_username(&quot;https://twitter.com/foo123?blah&quot;)
get_username(&quot;Twitter foo123&quot;)
get_username(&quot;Twitter @foo123&quot;)
get_username(&quot;Twitter: foo123&quot;)
get_username(&quot;Twitter: foo123 | youtube: ...&quot;)
get_username(&quot;Twitt11er: foo123 | youtube: ...&quot;)

foo123
foo123
foo123
foo123
foo123
foo123
foo123
foo123

答案2

得分: 1

如果它始终以用户名结尾，只需使用 (\w+)$

def get_username(string):
    if match1 := re.search(r'(\w+)$', string):
        return match1.group(1)
    return None

英文:

If it always end with the username, just use (\w+)$

def get_username(string):
    if match1 := re.search(r&#39;(\w+)$&#39;, string):
        return match1.group(1)
    return None

答案3

得分: 1

我会尝试负向先行断言(?!https?://)来排除所有似乎以http://或https://开头的用户名。

twitter(?:(?:\.com/)|(?:[\s\:@]+))(?!https?://)([a-z0-9_\.\-]+)

在regex101上尝试

英文:

I'd try a negative lookahead of (?!https?://) to exclude all usernames which appear to start with http:// or https://.

twitter(?:(?:\.com/)|(?:[\s\:@]+))(?!https?://)([a-z0-9_\.\-]+)

Try on regex101

答案4

得分: 1

以下是代码部分的翻译：

import re
pattern = re.compile(r"\btwitter(?:\.com/|:?(?!\s*(?:https?://|twitter\b))\s+@?)([\w.-]+)", re.IGNORECASE)
def get_username(string):
    m = pattern.search(string)
    if m:
        return m.group(1)
    return None
print(get_username("Twitter: https://twitter.com/foo123"))
print(get_username("Twitter: twitter.com/foo123"))
print(get_username("https://twitter.com/foo123"))
print(get_username("https://twitter.com/foo123?blah"))
print(get_username("Twitter foo123"))
print(get_username("Twitter @foo123"))
print(get_username("Twitter: foo123"))
print(get_username("Twitter: foo123 | youtube: ..."))

输出：

foo123
foo123
foo123
foo123
foo123
foo123
foo123
foo123

希望这有帮助。如果有其他疑问，请随时提出。

英文:

If there can be multiple matches, you can use a negative lookahead to rule out twitter or http:// or https:// to the right, and get the capture group 1 value.

\btwitter(?:\.com/|(?!:?\s*(?:https?://|twitter\b)):?\s+@?)([\w.-]+)

Explanation

\btwitter Match the word twitter
(?: Non capture group for the alternatives
- \.com/ Match .com/
- | Or
- (?!:?\s*(?:https?://|twitter\b)) Negative lookahead, assert not http:// or the word twitter preceded by an optional : and whitspace chars directly to the right of the current position
:?\s+@?) Match an optional : 1+ whitspace chars and optional @
([\w.-]+) Capture group 1, match 1+ of the listed characters

Regex demo | Python demo

import re
pattern = re.compile(r&quot;\btwitter(?:\.com/|:?(?!\s*(?:https?://|twitter\b))\s+@?)([\w.-]+)&quot;, re.IGNORECASE)
def get_username(string):
    m = pattern.search(string)
    if m:
        return m.group(1)
    return None
print(get_username(&quot;Twitter: https://twitter.com/foo123&quot;))
print(get_username(&quot;Twitter: twitter.com/foo123&quot;))
print(get_username(&quot;https://twitter.com/foo123&quot;))
print(get_username(&quot;https://twitter.com/foo123?blah&quot;))
print(get_username(&quot;Twitter foo123&quot;))
print(get_username(&quot;Twitter @foo123&quot;))
print(get_username(&quot;Twitter: foo123&quot;))
print(get_username(&quot;Twitter: foo123 | youtube: ...&quot;))

Output

foo123
foo123
foo123
foo123
foo123
foo123
foo123
foo123

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

这些正则表达式对可以简化成一个吗？

问题

答案2

答案3

答案4

“Maximize” pygame window

Common mocks defined with @patch to several test case functions in Python.

我不明白为什么这是一个“语法错误”？

Module 'numpy' has no attribute 'warnings'

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论