忽略在正则表达式中创建单词的开头部分

huangapple go评论106阅读模式
英文:

Ignore creating beginnings of words in a regular expression

问题

我试图解析消息中的所有链接。

我的Java代码如下:

Pattern URLPATTERN = Pattern.compile(
    "([--:\\w?@%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&](?:\\w+)=(?:\\w+))+|[--:\\w?@%&+~#=]+)?",
    Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

Matcher matcher = Patterns.URLPATTERN.matcher(message);
ArrayList<int[]> links = new ArrayList<>();
while (matcher.find())
    links.add(new int[] {matcher.start(1), matcher.end()});
[...]

现在的问题是,链接有时候以颜色代码开头,格式如下:[&amp;&#167;]{1}[a-z0-9]{1}

一个示例可以是:请使用Google:&#167;ehttps://google.com,并且不要问我。

使用我在互联网上找到的正则表达式,它会匹配如下内容:ehttps://google.com,但实际上应该只匹配 https://google.com

现在,我该如何修改上述正则表达式,以排除紧随颜色代码之后的模式,但仍然匹配紧随其后的链接呢?

英文:

I'm trying to parse all the links in a message.

My Java-Code looks the following:

Pattern URLPATTERN = Pattern.compile(
    &quot;([--:\\w?@%&amp;+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&amp;](?:\\w+)=(?:\\w+))+|[--:\\w?@%&amp;+~#=]+)?&quot;,
    Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

Matcher matcher = Patterns.URLPATTERN.matcher(message);
ArrayList&lt;int[]&gt; links = new ArrayList&lt;&gt;();
while (matcher.find())
    links.add(new int[] {matcher.start(1), matcher.end()});
[...]

The problem now is that the links sometimes start with a colour-code that looks the following: [&amp;&#167;]{1}[a-z0-9]{1}

An example could be: Please use Google: &#167;ehttps://google.com, and don&#39;t ask me.

With the regex expression, I found somewhere on the internet it will match the following: ehttps://google.com but it should only match https://google.com

Now how can I change the regular expression above to exclude the following pattern but still match the link that follows just after the color-code?

[&amp;&#167;]{1}[a-z0-9]{1}

答案1

得分: 2

你可以在你的正则表达式开头添加一个 (?:[&amp;&#167;][a-z0-9])? 模式(匹配一个可选的序列,其中包括一个 &amp;&#167;,然后是一个ASCII字母或数字):

Pattern URLPATTERN = Pattern.compile(
    "(?:[&amp;&#167;][a-z0-9])?([--:\\w?@%&amp;+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&amp;]\\w+=\\w+)+|[--:\\w?@%&amp;+~#=]+)?", Pattern.CASE_INSENSITIVE);

参见正则表达式演示

当正则表达式找到 &#167;ehttps://google.com 时,&#167;e 与可选的非捕获组 (?:[&amp;&#167;][a-z0-9])? 匹配,这就是为什么它从第1组的值中“排除”出来。

对于你的正则表达式,不需要使用 Pattern.MULTILINE | Pattern.DOTALL,因为模式中没有 .,也没有 ^/$

英文:

You can add a (?:[&amp;&#167;][a-z0-9])? pattern (matching an optional sequence of a &amp; or &#167; and then an ASCII letter or digit) at the beginning of your regex:

Pattern URLPATTERN = Pattern.compile(
    &quot;(?:[&amp;&#167;][a-z0-9])?([--:\\w?@%&amp;+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&amp;]\\w+=\\w+)+|[--:\\w?@%&amp;+~#=]+)?&quot;, Pattern.CASE_INSENSITIVE);

See the regex demo.

When the regex finds &#167;ehttps://google.com, the &#167;e is matched with the optional non-capturing group (?:[&amp;&#167;][a-z0-9])?, that is why it is "excluded" from the Group 1 value.

There is no need using Pattern.MULTILINE | Pattern.DOTALL with your regex, there is no . and no ^/$ in the pattern.

huangapple
  • 本文由 发表于 2020年10月12日 00:13:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/64306272.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定