英文:
How does Java's Matcher.group (int) method avoid match the contents of sub-braces inside parentheses
问题
翻译好的部分如下:
我有一个字符串,类似于:
String str = "美国临时申请No.62004615";
还有一个正则表达式,类似于:
String regex = "(((美国|PCT|加拿大){0,1})([\\u4E00-\\u9FA5]{1,8})((NO.|NOS.){1})([\\d]{5,}))";
其他代码为
Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println("1:" + matcher.group(1) + "\n"
+ "2:" + matcher.group(2) + "\n"
+ "3:" + matcher.group(3) + "\n"
+ "4:" + matcher.group(4) + "\n"
+ "5:" + matcher.group(5) + "\n"
+ "6:" + matcher.group(6) + "\n"
+ "7:" + matcher.group(7));
}
我知道圆括号()用于启用正则表达式短语的分组。第1组是大组。
第二组是((美国|PCT|加拿大){0,1})用于匹配“美国”或“PCT”或“加拿大”。
第三组是([\u4E00-\u9FA5]{1,8})用于匹配长度为一到八的汉字字符。
第四组是((NO.|NOS.){1})用于匹配NO.或NOS。
第五组是([\d]{5,})用于匹配数字。
但是控制台显示的是
1:美国临时申请No.62004615 2:美国 3:美国 4:临时申请 5:No. 6:No. 7:62004615
第2组与第3组相同。第5组与第6组相同。
似乎第3组再次匹配括号内的子括号。我想知道是否有一种方法只匹配最外层的括号。
理想的结果应该是
1:美国临时申请No.62004615 2:美国 3:临时申请 4:No. 5:62004615
英文:
I have a string like
String str = "美国临时申请No.62004615";
And a regex like
String regex = "(((美国|PCT|加拿大){0,1})([\\u4E00-\\u9FA5]{1,8})((NO.|NOS.){1})([\\d]{5,}))";
And other code is
Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println("1:"+matcher.group(1)+"\n"
+"2:"+matcher.group(2)+"\n"
+"3:"+matcher.group(3)+"\n"
+"4:"+matcher.group(4)+"\n"
+"5:"+matcher.group(5)+"\n"
+"6:"+matcher.group(6)+"\n"
+"7:"+matcher.group(7));
}
I know Parenthesis () are used to enable grouping of regex phrases. And group 1 is the big group.</p>
The second group is ((美国|PCT|加拿大){0,1}) to match the "美国" or "PCT" or "加拿大".</p>
The third group is ([\u4E00-\u9FA5]{1,8}) to match the chinese character which length is one to eight.</p>
The fouth group is ((NO.|NOS.){1}) to match the NO. or NOS.
The fifth group is ([\d]{5,}) to match the number </p>
But the console is
1:美国临时申请No.62004615 2:美国 3:美国 4:临时申请 5:No. 6:No. 7:62004615
The group (2) is the same as group (3).The group (5) is the same as group (6)</p>
It seems that group (3) rematches the sub-parentheses inside the parentheses again. I wonder if there is a way to match only the outermost parentheses。</p>
The ideal result should be
1:美国临时申请No.62004615 2:美国 3:临时申请 4:No. 5:62004615
答案1
得分: 2
看起来你想要一个非捕获组。来自模式文档:
> <code>(?:</code><em>X</em><code>)</code> <em>X</em>,作为非捕获组
所以,将这个:
(美国|PCT|加拿大)
改成这个:
(?:美国|PCT|加拿大)
…然后在匹配器中它将不再表示为一个组。
一些附注:
{0,1}
与写作?
相同。{1}
没有任何作用,可以完全删除。[\\d]
与\\d
相同。
英文:
It sounds like you want a non-capturing group. From the Pattern documentation:
> <code>(?:</code><em>X</em><code>)</code> <em>X</em>, as a non-capturing group
So, change this:
(美国|PCT|加拿大)
to this:
(?:美国|PCT|加拿大)
… and then it will not be represented as a group at all in the Matcher.
Some side notes:
{0,1}
is the same as writing?
.{1}
does nothing and can be removed entirely.[\\d]
is the same as just\\d
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论