英文:
Ignore creating beginnings of words in a regular expression
问题
我试图解析消息中的所有链接。
我的Java代码如下:
Pattern URLPATTERN = Pattern.compile(
"([--:\\w?@%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&](?:\\w+)=(?:\\w+))+|[--:\\w?@%&+~#=]+)?",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
Matcher matcher = Patterns.URLPATTERN.matcher(message);
ArrayList<int[]> links = new ArrayList<>();
while (matcher.find())
links.add(new int[] {matcher.start(1), matcher.end()});
[...]
现在的问题是,链接有时候以颜色代码开头,格式如下:[&§]{1}[a-z0-9]{1}
一个示例可以是:请使用Google:§ehttps://google.com,并且不要问我。
使用我在互联网上找到的正则表达式,它会匹配如下内容:ehttps://google.com
,但实际上应该只匹配 https://google.com
。
现在,我该如何修改上述正则表达式,以排除紧随颜色代码之后的模式,但仍然匹配紧随其后的链接呢?
英文:
I'm trying to parse all the links in a message.
My Java-Code looks the following:
Pattern URLPATTERN = Pattern.compile(
"([--:\\w?@%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&](?:\\w+)=(?:\\w+))+|[--:\\w?@%&+~#=]+)?",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
Matcher matcher = Patterns.URLPATTERN.matcher(message);
ArrayList<int[]> links = new ArrayList<>();
while (matcher.find())
links.add(new int[] {matcher.start(1), matcher.end()});
[...]
The problem now is that the links sometimes start with a colour-code that looks the following: [&§]{1}[a-z0-9]{1}
An example could be: Please use Google: §ehttps://google.com, and don't ask me.
With the regex expression, I found somewhere on the internet it will match the following: ehttps://google.com
but it should only match https://google.com
Now how can I change the regular expression above to exclude the following pattern but still match the link that follows just after the color-code?
[&§]{1}[a-z0-9]{1}
答案1
得分: 2
你可以在你的正则表达式开头添加一个 (?:[&§][a-z0-9])?
模式(匹配一个可选的序列,其中包括一个 &
或 §
,然后是一个ASCII字母或数字):
Pattern URLPATTERN = Pattern.compile(
"(?:[&§][a-z0-9])?([--:\\w?@%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&]\\w+=\\w+)+|[--:\\w?@%&+~#=]+)?", Pattern.CASE_INSENSITIVE);
参见正则表达式演示。
当正则表达式找到 §ehttps://google.com
时,§e
与可选的非捕获组 (?:[&§][a-z0-9])?
匹配,这就是为什么它从第1组的值中“排除”出来。
对于你的正则表达式,不需要使用 Pattern.MULTILINE | Pattern.DOTALL
,因为模式中没有 .
,也没有 ^
/$
。
英文:
You can add a (?:[&§][a-z0-9])?
pattern (matching an optional sequence of a &
or §
and then an ASCII letter or digit) at the beginning of your regex:
Pattern URLPATTERN = Pattern.compile(
"(?:[&§][a-z0-9])?([--:\\w?@%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&]\\w+=\\w+)+|[--:\\w?@%&+~#=]+)?", Pattern.CASE_INSENSITIVE);
See the regex demo.
When the regex finds §ehttps://google.com
, the §e
is matched with the optional non-capturing group (?:[&§][a-z0-9])?
, that is why it is "excluded" from the Group 1 value.
There is no need using Pattern.MULTILINE | Pattern.DOTALL
with your regex, there is no .
and no ^
/$
in the pattern.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论