英文:
Ignore creating beginnings of words in a regular expression
问题
我试图解析消息中的所有链接。
我的Java代码如下:
Pattern URLPATTERN = Pattern.compile(
"([--:\\w?@%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&](?:\\w+)=(?:\\w+))+|[--:\\w?@%&+~#=]+)?",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
Matcher matcher = Patterns.URLPATTERN.matcher(message);
ArrayList<int[]> links = new ArrayList<>();
while (matcher.find())
links.add(new int[] {matcher.start(1), matcher.end()});
[...]
现在的问题是,链接有时候以颜色代码开头,格式如下:[&§]{1}[a-z0-9]{1}
一个示例可以是:请使用Google:§ehttps://google.com,并且不要问我。
使用我在互联网上找到的正则表达式,它会匹配如下内容:ehttps://google.com,但实际上应该只匹配 https://google.com。
现在,我该如何修改上述正则表达式,以排除紧随颜色代码之后的模式,但仍然匹配紧随其后的链接呢?
英文:
I'm trying to parse all the links in a message.
My Java-Code looks the following:
Pattern URLPATTERN = Pattern.compile(
"([--:\\w?@%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&](?:\\w+)=(?:\\w+))+|[--:\\w?@%&+~#=]+)?",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
Matcher matcher = Patterns.URLPATTERN.matcher(message);
ArrayList<int[]> links = new ArrayList<>();
while (matcher.find())
links.add(new int[] {matcher.start(1), matcher.end()});
[...]
The problem now is that the links sometimes start with a colour-code that looks the following: [&§]{1}[a-z0-9]{1}
An example could be: Please use Google: §ehttps://google.com, and don't ask me.
With the regex expression, I found somewhere on the internet it will match the following: ehttps://google.com but it should only match https://google.com
Now how can I change the regular expression above to exclude the following pattern but still match the link that follows just after the color-code?
[&§]{1}[a-z0-9]{1}
答案1
得分: 2
你可以在你的正则表达式开头添加一个 (?:[&§][a-z0-9])? 模式(匹配一个可选的序列,其中包括一个 & 或 §,然后是一个ASCII字母或数字):
Pattern URLPATTERN = Pattern.compile(
"(?:[&§][a-z0-9])?([--:\\w?@%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&]\\w+=\\w+)+|[--:\\w?@%&+~#=]+)?", Pattern.CASE_INSENSITIVE);
参见正则表达式演示。
当正则表达式找到 §ehttps://google.com 时,§e 与可选的非捕获组 (?:[&§][a-z0-9])? 匹配,这就是为什么它从第1组的值中“排除”出来。
对于你的正则表达式,不需要使用 Pattern.MULTILINE | Pattern.DOTALL,因为模式中没有 .,也没有 ^/$。
英文:
You can add a (?:[&§][a-z0-9])? pattern (matching an optional sequence of a & or § and then an ASCII letter or digit) at the beginning of your regex:
Pattern URLPATTERN = Pattern.compile(
"(?:[&§][a-z0-9])?([--:\\w?@%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&]\\w+=\\w+)+|[--:\\w?@%&+~#=]+)?", Pattern.CASE_INSENSITIVE);
See the regex demo.
When the regex finds §ehttps://google.com, the §e is matched with the optional non-capturing group (?:[&§][a-z0-9])?, that is why it is "excluded" from the Group 1 value.
There is no need using Pattern.MULTILINE | Pattern.DOTALL with your regex, there is no . and no ^/$ in the pattern.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论