正则表达式 – 取消格式化 XML

huangapple go评论81阅读模式
英文:

Regex - Unformat XML

问题

我正在尝试将XML格式化为单行。(使用JAVA)

我尝试使用以下正则表达式进行替换。

input.replaceAll(">\\s+", ">").replaceAll("\\s+<", "<");

然而,这也会移除元素前后的空格。这是意外的。

例如:

情况01

之前:<AAA>{空格}{空格}{空格}</AAA>

之后:<AAA></AAA>

情况02

之前:<AAA>{空格}{空格}123{空格}{空格}</AAA>

之后:<AAA>123</AAA>

情况03

之前:<AAA>{空格}A{空格}B{空格}C{空格}</AAA>

之后:<AAA>A{空格}B{空格}C</AAA>

有没有办法取消格式化并避免上述情况?

英文:

I am trying to unformat a XML to single line. (Using JAVA)

I trying to use following regex to replace.

input.replaceAll(">\\s+", ">").replaceAll("\\s+<", "<");

However, it also will remove the space in front and behind element.
Which is unexpected.

For example:

Scenario 01

Before: <AAA>{space}{space}{space}</AAA>

After: <AAA></AAA>

Scenario 02

Before: <AAA>{space}{space}123{space}{space}</AAA>

After: <AAA>123</AAA>

Scenario 03

Before: <AAA>{space}A{space}B{space}C{space}</AAA>

After: <AAA>A{space}B{space}C</AAA>

Is there any way to unformat and avoid scenario above?

答案1

得分: 1

一个萨克森解决方案:

Processor p = new Processor(false);
DocumentBuilder db = p.newDocumentBuilder();
db.setWhitespaceStrippingPolicy(WhitespaceStrippingPolicy.ALL);
XdmNode doc = db.build(new File(...));
Serializer s = p.newSerializer(new File(...));
s.serialize(doc.asSource());

通过在Serializer对象上设置属性,您可以对输出格式有相当多的控制。

英文:

A Saxon solution:

Processor p = new Processor(false);
DocumentBuilder db = p.newDocumentBuilder();
db.setWhitespaceStrippingPolicy(WhitespaceStrippingPolicy.ALL);
XdmNode doc = db.build(new File(...));
Serializer s = p.newSerializer(new File(...));
s.serialize(doc.asSource());

This gives you quite a lot of control over the format of the output by setting properties on the Serializer object.

答案2

得分: 0

这将仅替换标签结束后和标签开始前的垂直空白,例如"\n"、"\r"或其组合以及其他情况。

input.replaceAll(">\\v+", ">").replaceAll("\\v+<", "<");

来自 https://www.regular-expressions.info/shorthand.html 的摘录说:

> \v 匹配“垂直空白”,包括 Unicode 标准中视为换行的所有字符。与 [\n\cK\f\r\x85\x{2028}\x{2029}] 相同。

英文:

This will only replace vertical whitespaces following tag ends and preceding tag starts, e.g. "\n", "\r" or combinations, and others.

input.replaceAll(">\\v+", ">").replaceAll("\\v+<", "<");

Excerpt from https://www.regular-expressions.info/shorthand.html says:

> \v matches “vertical whitespace”, which includes all characters treated as line breaks in the Unicode standard. It is the same as [\n\cK\f\r\x85\x{2028}\x{2029}].

huangapple
  • 本文由 发表于 2020年10月1日 14:47:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/64150323.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定