在Java中,使用和不使用分组的正则表达式表现不同。

huangapple go评论75阅读模式
英文:

Different behavior of regular expressions in java with and without using groups

问题

以下是您要翻译的内容:

我正在Java中尝试使用正则表达式,特别是与分组相关的部分。我试图从包含XML的字符串中去除空标签。如果不使用分组,一切都正常,但是如果我尝试使用带有分组的正则表达式,就会出现我不理解的情况。我希望的行为类似于下面代码中的最后一个断言:

我可以使用这个正则表达式:"\\s*<\\s*\\w+\\s*>\\s*</\\s*\\w+\\s*>",但我不明白为什么我不能用 "(\\s*)<(\\s*\\w+\\s*)>(\\1)</(\\2)>" 做同样的事情。
请向我解释在这里指定的正则表达式行为上的区别。

英文:

I am experimenting with regular expressions in Java, in particular with groups. I am trying to strip empty tags from a string with xml. Without using groups, everything seems to be fine, but if I try to define a regex using groups, magic begins that I don't understand. I expect behavior like last assertion in code below:

    @Test
    public void testRegexpGroups() {
        String xml =
            "<root>\n" +
                "    <yyy></yyy>\n" +
                "    <yyy>456</yyy>\n" +
                "    <aaa>  \n\n" +
                "    </aaa>\n" +
                "</root>";
        Pattern patternA = Pattern.compile("(\\s*)<(\\s*\\w+\\s*)>(\\1)</(\\2)>");
        Pattern patternB = Pattern.compile("(\\s*)<(\\s*\\w+\\s*)>\\s*</(\\2)>");
        Pattern patternC = Pattern.compile("\\s*<\\s*\\w+\\s*>\\s*</\\s*\\w+\\s*>");


        assertEquals(
            "<root>\n" +
            "    \n" +
            "    <yyy>456</yyy>\n" +
            "    <aaa>  \n" +
            "\n" +
            "    </aaa>\n" +
            "</root>",
            patternA.matcher(xml).replaceAll("")
        );

        assertEquals(
            "<root>\n" +
                "    <yyy>456</yyy>\n" +
                "</root>",
            patternB.matcher(xml).replaceAll("")
        );

        assertEquals(
            "<root>\n" +
                "    <yyy>456</yyy>\n" +
                "</root>",
            patternC.matcher(xml).replaceAll("")
        );
    }

I can get it if I use this regex: "\\s*<\\s*\\w+\\s*>\\s*</\\s*\\w+\\s*>", but I don't understand why I can't do the same with "(\\s*)<(\\s*\\w+\\s*)>(\\1)</(\\2)>"
Please explain to me the difference in the behavior of the regular expressions specified here.

答案1

得分: 0

在正则表达式中,\1\2 被称为反向引用。它们寻找先前捕获组先前匹配的相同文本。它们使你能够编写正则表达式,例如检测重复的字母和单词。

例如,(\w+)\1 匹配重复出现两次的字符串"words"。

"banana".matches("(\\w+)\\1") // ==> false

"banabana".matches("(\\w+)\\1") // ==> true: bana 重复出现

在你的正则表达式 "(\\s*)<(\\s*\\w+\\s*)>(\\1)</(\\2)>" 中,你要求标签内的空白与标签前的空白匹配。

英文:

In regular expressions, \1 and \2 are called back references. They look for the same text that was matched previously by a capturing group. They enable you to write regular expressions that for example detect duplicated letters and words.

For example (\w+)\1 matches strings "words" that are the same text repeated twice.

&quot;banana&quot;.matches(&quot;(\\w+)\\1&quot;) // ==&gt; false
&quot;banabana&quot;.matches(&quot;(\\w+)\\1&quot;) // ==&gt; true: bana is repeated

In your regexp &quot;(\\s*)&lt;(\\s*\\w+\\s*)&gt;(\\1)&lt;/(\\2)&gt;&quot; you require that the white space within the tag matches the white space before the tag.

huangapple
  • 本文由 发表于 2020年8月19日 18:49:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/63485347.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定