如何在Java中解析XML,如果标签的内容包含>或<?

huangapple go评论87阅读模式
英文:

How would you parse xml in java if the content of a tag contains > or <?

问题

目前,我正在使用XMLInputFactoryXMLEventReader来解析来自rss数据源的XML。在描述部分中,它包含了使用><的html标签。Java将其视为实际标签,并认为这是描述的结尾,因此它截断并转到下一个元素。我如何在解析时排除这些标签?

英文:

Currently, I'm using XMLInputFactory and XMLEventReader to parse XML from a rss data feed. In the description, it contains html tags in the using of > and <. Java reads this as actual tags and it thinks that the end of the description, so it cuts off and goes to the next element. How can I exclude the tags from parsing?

答案1

得分: 0

我不常使用拉取解析器(XMLEventReader),但我相信它与SAX解析器一样,可以将文本节点报告为一系列Characters事件,而不是单个事件,由应用程序负责将它们连接起来。解析器最有可能选择拆分内容的地方是在实体边界处,以避免在扩展实体时执行字符数据的批量复制。

英文:

I don't use the pull parser (XMLEventReader) much, but I believe that, as with the SAX parser, it can report a text node as a sequence of Characters events, rather than as a single event, and it's up to the application to concatenate them. The most likely place the parser is likely to choose to split the content is at entity boundaries, to avoid doing bulk copying of character data when expanding entities.

答案2

得分: -1

你可以临时用你知道的特定唯一标签替换每个 >< 标签,然后进行解析,完成解析后再将它们替换为 >< 标签,就像下面的代码示例一样。

String original = "<container>>This< is a >test<</container>";
String newStr = original.replace(">", "_TMP_CHARACTER_G_").replace("<", "_TMP_CHARACTER_L_");
System.out.println(original + "\n" + newStr);
// 打印 <container>>This< is a >test<</container>
// 和 <container>_TMP_CHARACTER_G_This_TMP_CHARACTER_L_ is a _TMP_CHARACTER_G_test_TMP_CHARACTER_L_</container>

// [在此处进行解析]

String theTagYouWant = newStr;
String theConvertedTag = theTagYouWant.replace("_TMP_CHARACTER_G_", ">").replace("_TMP_CHARACTER_L_", "<");
System.out.println(theConvertedTag);
// 打印原始字符串 <container>>This< is a >test<</container>
英文:

You could temporary replace every > and < tags by a specific unique label you know. Then, do your parsing, and replace them with the > and < tags again when you are done with your parsing, like in the following code.

String original = "<container>>This< is a >test<</container>";
String newStr = original.replace(">", "_TMP_CHARACTER_G_").replace("<", "_TMP_CHARACTER_L_");
System.out.println(original + "\n" + newStr);
// Print <container>>This< is a >test<</container>
// and <container>_TMP_CHARACTER_G_This_TMP_CHARACTER_L_ is a _TMP_CHARACTER_G_test_TMP_CHARACTER_L_</container>
        
// [Do your parsing here]
        
String theTagYouWant = newStr;
String theConvertedTag = theTagYouWant.replace("_TMP_CHARACTER_G_", ">").replace("_TMP_CHARACTER_L_", "<");
System.out.println(theConvertedTag);
// Print the original String <container>>This< is a >test<</container>

huangapple
  • 本文由 发表于 2020年8月19日 03:39:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/63475576.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定