英文:
Regex - Unformat XML
问题
我正在尝试将XML格式化为单行。(使用JAVA)
我尝试使用以下正则表达式进行替换。
input.replaceAll(">\\s+", ">").replaceAll("\\s+<", "<");
然而,这也会移除元素前后的空格。这是意外的。
例如:
情况01
之前:<AAA>{空格}{空格}{空格}</AAA>
之后:<AAA></AAA>
情况02
之前:<AAA>{空格}{空格}123{空格}{空格}</AAA>
之后:<AAA>123</AAA>
情况03
之前:<AAA>{空格}A{空格}B{空格}C{空格}</AAA>
之后:<AAA>A{空格}B{空格}C</AAA>
有没有办法取消格式化并避免上述情况?
英文:
I am trying to unformat a XML to single line. (Using JAVA)
I trying to use following regex to replace.
input.replaceAll(">\\s+", ">").replaceAll("\\s+<", "<");
However, it also will remove the space in front and behind element.
Which is unexpected.
For example:
Scenario 01
Before: <AAA>{space}{space}{space}</AAA>
After: <AAA></AAA>
Scenario 02
Before: <AAA>{space}{space}123{space}{space}</AAA>
After: <AAA>123</AAA>
Scenario 03
Before: <AAA>{space}A{space}B{space}C{space}</AAA>
After: <AAA>A{space}B{space}C</AAA>
Is there any way to unformat and avoid scenario above?
答案1
得分: 1
一个萨克森解决方案:
Processor p = new Processor(false);
DocumentBuilder db = p.newDocumentBuilder();
db.setWhitespaceStrippingPolicy(WhitespaceStrippingPolicy.ALL);
XdmNode doc = db.build(new File(...));
Serializer s = p.newSerializer(new File(...));
s.serialize(doc.asSource());
通过在Serializer对象上设置属性,您可以对输出格式有相当多的控制。
英文:
A Saxon solution:
Processor p = new Processor(false);
DocumentBuilder db = p.newDocumentBuilder();
db.setWhitespaceStrippingPolicy(WhitespaceStrippingPolicy.ALL);
XdmNode doc = db.build(new File(...));
Serializer s = p.newSerializer(new File(...));
s.serialize(doc.asSource());
This gives you quite a lot of control over the format of the output by setting properties on the Serializer object.
答案2
得分: 0
这将仅替换标签结束后和标签开始前的垂直空白,例如"\n"、"\r"或其组合以及其他情况。
input.replaceAll(">\\v+", ">").replaceAll("\\v+<", "<");
来自 https://www.regular-expressions.info/shorthand.html 的摘录说:
> \v
匹配“垂直空白”,包括 Unicode 标准中视为换行的所有字符。与 [\n\cK\f\r\x85\x{2028}\x{2029}]
相同。
英文:
This will only replace vertical whitespaces following tag ends and preceding tag starts, e.g. "\n", "\r" or combinations, and others.
input.replaceAll(">\\v+", ">").replaceAll("\\v+<", "<");
Excerpt from https://www.regular-expressions.info/shorthand.html says:
> \v
matches “vertical whitespace”, which includes all characters treated as line breaks in the Unicode standard. It is the same as [\n\cK\f\r\x85\x{2028}\x{2029}]
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论