正则表达式以在多行字符串中查找XML标签

huangapple go评论121阅读模式
英文:

Regex to find XML tag in multiline string

问题

以下是翻译好的代码部分:

public static String getTagAValue(String xmlAsString) {
    Pattern pattern = Pattern.compile("<TagA>(.+)</TagA>");
    Matcher matcher = pattern.matcher(xmlAsString);
    if (matcher.find()) {
        return matcher.group(1);
    } else {
        return null;
    }
}

XML示例:

<xml>
    <sample>
        <TagA>result</TagA>
    </sample>
</xml>

注意,这里我使用了4个空格来表示制表符,但实际字符串中可能包含制表符。

英文:

Here is a simple function I wrote to get the value from a tag.

public static String getTagAValue(String xmlAsString) {
	Pattern pattern = Pattern.compile(&quot;&lt;TagA&gt;(.+)&lt;/TagA&gt;&quot;);
	Matcher matcher = pattern.matcher(xmlAsString);
	if (matcher.find()) {
		return matcher.group(1);
	} else {
		return null;
	}
}

It is not finding a match and returning null.

XML Sample

&lt;xml&gt;
    &lt;sample&gt;
        &lt;TagA&gt;result&lt;/TagA&gt;
    &lt;/sample&gt;
&lt;/xml&gt;

Note, here I used 4 spaces for tabs, but the real string would contain tabs.

答案1

得分: 3

不要使用正则表达式解析XML:这不是适合该任务的工具。

经典答案在这里:https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

你所接受的答案给出了错误的结果,例如:

  • 它不会在允许空格的位置接受空格,比如在“>”之前;

  • 它会匹配被注释掉的元素或出现在CDATA节中的元素;

  • 它使用贪婪匹配,因此它会找到最后一个匹配的结束标签,而不是第一个匹配的标签。

无论你多么努力,都不可能百分之百地做对。

如果你更在意性能而不是正确性,那么它也因为需要回溯而极其低效。

为了正确而专业地完成这项工作,使用XML解析器。

英文:

Don't use regular expressions to parse XML: it's the wrong tool for the job.

Classic answer here: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

The answer you have accepted gives wrong answers, for example:

  • It doesn't accept whitespace in places where whitespace is allowed, such as before ">"

  • It will match a commented-out element, or one that appears in a CDATA section

  • It does a greedy match, so it will find the LAST matching end tag, not the first one.

However hard you try, you will never get it 100% right.

And in case you care more about performance than correctness, it's also grossly inefficient because of the need for backtracking.

To do the job properly and professionally, use an XML parser.

答案2

得分: 2

你可能希望启用正则表达式在多行上工作:

Pattern.compile("<TagA>(.+)</TagA>", Pattern.DOTALL);

文档解释了参数Pattern.DOTALL

启用dotall模式。在dotall模式下,表达式.匹配任何字符,包括行终止符。默认情况下,此表达式不匹配行终止符。

编辑: 虽然在这种特定情况下这样做是有效的,但如果您想专业、高效且正确地解决此类问题,请参考Michael Kay的回答。

英文:

You probably want to enable that the RegExp works on multi-line:

Pattern.compile(&quot;&lt;TagA&gt;(.+)&lt;/TagA&gt;&quot;, Pattern.DOTALL);

Documentation explains the parameter Pattern.DOTALL:

> Enables dotall mode. In dotall mode, the expression . matches any
> character, including a line terminator. By default this expression
> does not match line terminators.

Edit: While this works in this particular case, please everyone refer to the answert of Michael Kay if you want to solve such problems professionally, efficiently and right.

huangapple
  • 本文由 发表于 2020年9月11日 01:47:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/63835098.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定