Java的Matcher.group(int)方法如何避免匹配括号内子括号中的内容

huangapple go评论66阅读模式
英文:

How does Java's Matcher.group (int) method avoid match the contents of sub-braces inside parentheses

问题

翻译好的部分如下:

我有一个字符串,类似于:

String str = "美国临时申请No.62004615";

还有一个正则表达式,类似于:

String regex = "(((美国|PCT|加拿大){0,1})([\\u4E00-\\u9FA5]{1,8})((NO.|NOS.){1})([\\d]{5,}))";

其他代码为

 Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(str);
    while (matcher.find()) {
        System.out.println("1:" + matcher.group(1) + "\n"
                + "2:" + matcher.group(2) + "\n"
                + "3:" + matcher.group(3) + "\n"
                + "4:" + matcher.group(4) + "\n"
                + "5:" + matcher.group(5) + "\n"
                + "6:" + matcher.group(6) + "\n"
                + "7:" + matcher.group(7));
    }

我知道圆括号()用于启用正则表达式短语的分组。第1组是大组。
第二组是((美国|PCT|加拿大){0,1})用于匹配“美国”或“PCT”或“加拿大”。
第三组是([\u4E00-\u9FA5]{1,8})用于匹配长度为一到八的汉字字符。
第四组是((NO.|NOS.){1})用于匹配NO.或NOS。
第五组是([\d]{5,})用于匹配数字。
但是控制台显示的是

1:美国临时申请No.62004615 2:美国 3:美国 4:临时申请 5:No. 6:No. 7:62004615

第2组与第3组相同。第5组与第6组相同。
似乎第3组再次匹配括号内的子括号。我想知道是否有一种方法只匹配最外层的括号。
理想的结果应该是

1:美国临时申请No.62004615 2:美国  3:临时申请 4:No. 5:62004615
英文:

I have a string like

String str = "美国临时申请No.62004615";

And a regex like

String regex = "(((美国|PCT|加拿大){0,1})([\\u4E00-\\u9FA5]{1,8})((NO.|NOS.){1})([\\d]{5,}))";

And other code is

 Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(str);
    while (matcher.find()) {
        System.out.println("1:"+matcher.group(1)+"\n"
                +"2:"+matcher.group(2)+"\n"
                +"3:"+matcher.group(3)+"\n"
                +"4:"+matcher.group(4)+"\n"
                +"5:"+matcher.group(5)+"\n"
                +"6:"+matcher.group(6)+"\n"
                +"7:"+matcher.group(7));
    }

I know Parenthesis () are used to enable grouping of regex phrases. And group 1 is the big group.</p>
The second group is ((美国|PCT|加拿大){0,1}) to match the "美国" or "PCT" or "加拿大".</p>
The third group is ([\u4E00-\u9FA5]{1,8}) to match the chinese character which length is one to eight.</p>
The fouth group is ((NO.|NOS.){1}) to match the NO. or NOS.
The fifth group is ([\d]{5,}) to match the number </p>
But the console is

1:美国临时申请No.62004615 2:美国 3:美国 4:临时申请 5:No. 6:No. 7:62004615

The group (2) is the same as group (3).The group (5) is the same as group (6)</p>
It seems that group (3) rematches the sub-parentheses inside the parentheses again. I wonder if there is a way to match only the outermost parentheses。</p>
The ideal result should be

1:美国临时申请No.62004615 2:美国  3:临时申请 4:No. 5:62004615

答案1

得分: 2

看起来你想要一个非捕获组。来自模式文档

> <code>(?:</code><em>X</em><code>)</code> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <em>X</em>,作为非捕获组

所以,将这个:

(美国|PCT|加拿大)

改成这个:

(?:美国|PCT|加拿大)

…然后在匹配器中它将不再表示为一个组。

一些附注:

  • {0,1}与写作?相同。
  • {1}没有任何作用,可以完全删除。
  • [\\d]\\d相同。
英文:

It sounds like you want a non-capturing group. From the Pattern documentation:

> <code>(?:</code><em>X</em><code>)</code> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <em>X</em>, as a non-capturing group

So, change this:

(美国|PCT|加拿大)

to this:

(?:美国|PCT|加拿大)

… and then it will not be represented as a group at all in the Matcher.

Some side notes:

  • {0,1} is the same as writing ?.
  • {1} does nothing and can be removed entirely.
  • [\\d] is the same as just \\d.

huangapple
  • 本文由 发表于 2020年3月16日 19:15:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/60704939.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定