英文:
How would you parse xml in java if the content of a tag contains > or <?
问题
目前,我正在使用XMLInputFactory
和XMLEventReader
来解析来自rss数据源的XML。在描述部分中,它包含了使用&gt;
和&lt;
的html标签。Java将其视为实际标签,并认为这是描述的结尾,因此它截断并转到下一个元素。我如何在解析时排除这些标签?
英文:
Currently, I'm using XMLInputFactory
and XMLEventReader
to parse XML from a rss data feed. In the description, it contains html tags in the using of &gt;
and &lt;
. Java reads this as actual tags and it thinks that the end of the description, so it cuts off and goes to the next element. How can I exclude the tags from parsing?
答案1
得分: 0
我不常使用拉取解析器(XMLEventReader
),但我相信它与SAX解析器一样,可以将文本节点报告为一系列Characters
事件,而不是单个事件,由应用程序负责将它们连接起来。解析器最有可能选择拆分内容的地方是在实体边界处,以避免在扩展实体时执行字符数据的批量复制。
英文:
I don't use the pull parser (XMLEventReader
) much, but I believe that, as with the SAX parser, it can report a text node as a sequence of Characters
events, rather than as a single event, and it's up to the application to concatenate them. The most likely place the parser is likely to choose to split the content is at entity boundaries, to avoid doing bulk copying of character data when expanding entities.
答案2
得分: -1
你可以临时用你知道的特定唯一标签替换每个 &gt;
和 &lt;
标签,然后进行解析,完成解析后再将它们替换为 &gt;
和 &lt;
标签,就像下面的代码示例一样。
String original = "<container>&gt;This&lt; is a &gt;test&lt;</container>";
String newStr = original.replace("&gt;", "_TMP_CHARACTER_G_").replace("&lt;", "_TMP_CHARACTER_L_");
System.out.println(original + "\n" + newStr);
// 打印 <container>&gt;This&lt; is a &gt;test&lt;</container>
// 和 <container>_TMP_CHARACTER_G_This_TMP_CHARACTER_L_ is a _TMP_CHARACTER_G_test_TMP_CHARACTER_L_</container>
// [在此处进行解析]
String theTagYouWant = newStr;
String theConvertedTag = theTagYouWant.replace("_TMP_CHARACTER_G_", "&gt;").replace("_TMP_CHARACTER_L_", "&lt;");
System.out.println(theConvertedTag);
// 打印原始字符串 <container>&gt;This&lt; is a &gt;test&lt;</container>
英文:
You could temporary replace every &gt;
and &lt;
tags by a specific unique label you know. Then, do your parsing, and replace them with the &gt;
and &lt;
tags again when you are done with your parsing, like in the following code.
String original = "<container>&gt;This&lt; is a &gt;test&lt;</container>";
String newStr = original.replace("&gt;", "_TMP_CHARACTER_G_").replace("&lt;", "_TMP_CHARACTER_L_");
System.out.println(original + "\n" + newStr);
// Print <container>&gt;This&lt; is a &gt;test&lt;</container>
// and <container>_TMP_CHARACTER_G_This_TMP_CHARACTER_L_ is a _TMP_CHARACTER_G_test_TMP_CHARACTER_L_</container>
// [Do your parsing here]
String theTagYouWant = newStr;
String theConvertedTag = theTagYouWant.replace("_TMP_CHARACTER_G_", "&gt;").replace("_TMP_CHARACTER_L_", "&lt;");
System.out.println(theConvertedTag);
// Print the original String <container>&gt;This&lt; is a &gt;test&lt;</container>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论