2020年10月12日 00:13:53go评论159阅读模式

英文:

Ignore creating beginnings of words in a regular expression

问题

我试图解析消息中的所有链接。

我的Java代码如下：

Pattern URLPATTERN = Pattern.compile(
    "([--:\\w?@%&amp;+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&amp;](?:\\w+)=(?:\\w+))+|[--:\\w?@%&amp;+~#=]+)?",
    Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
Matcher matcher = Patterns.URLPATTERN.matcher(message);
ArrayList<int[]> links = new ArrayList<>();
while (matcher.find())
    links.add(new int[] {matcher.start(1), matcher.end()});
[...]

现在的问题是，链接有时候以颜色代码开头，格式如下：[&§]{1}[a-z0-9]{1}

一个示例可以是：请使用Google：§ehttps://google.com，并且不要问我。

使用我在互联网上找到的正则表达式，它会匹配如下内容：ehttps://google.com，但实际上应该只匹配 https://google.com。

现在，我该如何修改上述正则表达式，以排除紧随颜色代码之后的模式，但仍然匹配紧随其后的链接呢？

英文:

I'm trying to parse all the links in a message.

My Java-Code looks the following:

Pattern URLPATTERN = Pattern.compile(
    &quot;([--:\\w?@%&amp;+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&amp;](?:\\w+)=(?:\\w+))+|[--:\\w?@%&amp;+~#=]+)?&quot;,
    Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
Matcher matcher = Patterns.URLPATTERN.matcher(message);
ArrayList&lt;int[]&gt; links = new ArrayList&lt;&gt;();
while (matcher.find())
    links.add(new int[] {matcher.start(1), matcher.end()});
[...]

The problem now is that the links sometimes start with a colour-code that looks the following: [&§]{1}[a-z0-9]{1}

An example could be: Please use Google: §ehttps://google.com, and don't ask me.

With the regex expression, I found somewhere on the internet it will match the following: ehttps://google.com but it should only match https://google.com

Now how can I change the regular expression above to exclude the following pattern but still match the link that follows just after the color-code?

[&amp;&#167;]{1}[a-z0-9]{1}

答案1

得分: 2

你可以在你的正则表达式开头添加一个 (?:[&§][a-z0-9])? 模式（匹配一个可选的序列，其中包括一个 & 或 §，然后是一个ASCII字母或数字）：

Pattern URLPATTERN = Pattern.compile(
    "(?:[&amp;&#167;][a-z0-9])?([--:\\w?@%&amp;+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&amp;]\\w+=\\w+)+|[--:\\w?@%&amp;+~#=]+)?", Pattern.CASE_INSENSITIVE);

参见正则表达式演示。

当正则表达式找到 §ehttps://google.com 时，§e 与可选的非捕获组 (?:[&§][a-z0-9])? 匹配，这就是为什么它从第1组的值中“排除”出来。

对于你的正则表达式，不需要使用 Pattern.MULTILINE | Pattern.DOTALL，因为模式中没有 .，也没有 ^/$。

英文:

You can add a (?:[&§][a-z0-9])? pattern (matching an optional sequence of a & or § and then an ASCII letter or digit) at the beginning of your regex:

Pattern URLPATTERN = Pattern.compile(
    &quot;(?:[&amp;&#167;][a-z0-9])?([--:\\w?@%&amp;+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&amp;]\\w+=\\w+)+|[--:\\w?@%&amp;+~#=]+)?&quot;, Pattern.CASE_INSENSITIVE);

See the regex demo.

When the regex finds §ehttps://google.com, the §e is matched with the optional non-capturing group (?:[&§][a-z0-9])?, that is why it is "excluded" from the Group 1 value.

There is no need using Pattern.MULTILINE | Pattern.DOTALL with your regex, there is no . and no ^/$ in the pattern.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

忽略在正则表达式中创建单词的开头部分

问题

答案1

Write programs that read a line of input as a string and print the positions of all vowels in the string

关于Google Cloud Platform（GCP）的GAE在Java虚拟机内存使用方面的扩展性如何？

Running appium in saucelabs cloud devices changing public ip for country specific applications testing

Kotlin to Java conversion not working in Android studio Electric Eel. Is there another method to convert?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。