英文:
Different behavior of regular expressions in java with and without using groups
问题
以下是您要翻译的内容:
我正在Java中尝试使用正则表达式,特别是与分组相关的部分。我试图从包含XML的字符串中去除空标签。如果不使用分组,一切都正常,但是如果我尝试使用带有分组的正则表达式,就会出现我不理解的情况。我希望的行为类似于下面代码中的最后一个断言:
我可以使用这个正则表达式:"\\s*<\\s*\\w+\\s*>\\s*</\\s*\\w+\\s*>"
,但我不明白为什么我不能用 "(\\s*)<(\\s*\\w+\\s*)>(\\1)</(\\2)>"
做同样的事情。
请向我解释在这里指定的正则表达式行为上的区别。
英文:
I am experimenting with regular expressions in Java, in particular with groups. I am trying to strip empty tags from a string with xml. Without using groups, everything seems to be fine, but if I try to define a regex using groups, magic begins that I don't understand. I expect behavior like last assertion in code below:
@Test
public void testRegexpGroups() {
String xml =
"<root>\n" +
" <yyy></yyy>\n" +
" <yyy>456</yyy>\n" +
" <aaa> \n\n" +
" </aaa>\n" +
"</root>";
Pattern patternA = Pattern.compile("(\\s*)<(\\s*\\w+\\s*)>(\\1)</(\\2)>");
Pattern patternB = Pattern.compile("(\\s*)<(\\s*\\w+\\s*)>\\s*</(\\2)>");
Pattern patternC = Pattern.compile("\\s*<\\s*\\w+\\s*>\\s*</\\s*\\w+\\s*>");
assertEquals(
"<root>\n" +
" \n" +
" <yyy>456</yyy>\n" +
" <aaa> \n" +
"\n" +
" </aaa>\n" +
"</root>",
patternA.matcher(xml).replaceAll("")
);
assertEquals(
"<root>\n" +
" <yyy>456</yyy>\n" +
"</root>",
patternB.matcher(xml).replaceAll("")
);
assertEquals(
"<root>\n" +
" <yyy>456</yyy>\n" +
"</root>",
patternC.matcher(xml).replaceAll("")
);
}
I can get it if I use this regex: "\\s*<\\s*\\w+\\s*>\\s*</\\s*\\w+\\s*>"
, but I don't understand why I can't do the same with "(\\s*)<(\\s*\\w+\\s*)>(\\1)</(\\2)>"
Please explain to me the difference in the behavior of the regular expressions specified here.
答案1
得分: 0
在正则表达式中,\1
和 \2
被称为反向引用。它们寻找先前捕获组先前匹配的相同文本。它们使你能够编写正则表达式,例如检测重复的字母和单词。
例如,(\w+)\1
匹配重复出现两次的字符串"words"。
"banana".matches("(\\w+)\\1") // ==> false
"banabana".matches("(\\w+)\\1") // ==> true: bana 重复出现
在你的正则表达式 "(\\s*)<(\\s*\\w+\\s*)>(\\1)</(\\2)>"
中,你要求标签内的空白与标签前的空白匹配。
英文:
In regular expressions, \1
and \2
are called back references. They look for the same text that was matched previously by a capturing group. They enable you to write regular expressions that for example detect duplicated letters and words.
For example (\w+)\1
matches strings "words" that are the same text repeated twice.
"banana".matches("(\\w+)\\1") // ==> false
"banabana".matches("(\\w+)\\1") // ==> true: bana is repeated
In your regexp "(\\s*)<(\\s*\\w+\\s*)>(\\1)</(\\2)>"
you require that the white space within the tag matches the white space before the tag.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论