面对在XML解析中出现的org.xml.sax.SAXParseException异常

huangapple go评论66阅读模式
英文:

facing the org.xml.sax.SAXParseException Exception in xml parsing

问题

我在Java Spring Boot应用程序中编写了一个调度程序,它每小时运行一次,一个月来一直正常工作。但今天开始在解析时抛出异常。我猜可能是我从中获取数据的XML文件损坏了,或者可能发生了一些我无法找出的小改变。

请注意:我不能更改源数据。

以下是我的代码:

@Scheduled(fixedRate = 1*60*60*1000 , initialDelay = 10*1000)
public String updateNewsFeed() {

    try {
        DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        String URL = "https://nation.com.pk/rss/coronavirus";
        Document doc = db.parse(URL);
        List<NewsFeed> newsFeedList = parseNewsItemsToList(doc);
       
        return "工作正常";

    } catch (Exception ex) {
        return ex.getMessage();
    }
}

public List<NewsFeed> parseNewsItemsToList(Document doc) throws Exception{
    doc.getDocumentElement().normalize();
    NodeList nodes = doc.getElementsByTagName("item");
    List<NewsFeed> newsFeedList = new ArrayList<>();
    for (int i = 0; i < nodes.getLength(); i++) {
        Element element = (Element) nodes.item(i);

        NodeList title = element.getElementsByTagName("title");
        NodeList link = element.getElementsByTagName("link");
        NodeList description = element.getElementsByTagName("description");
        NodeList pubDate = element.getElementsByTagName("pubDate");
        NodeList guid = element.getElementsByTagName("guid");

        org.jsoup.nodes.Document htmlDoc = Jsoup.connect(link.item(0).getTextContent().trim()).get();

        String image = htmlDoc.select("picture").select("img[src~=(?i)\\.(png|jpe?g)]").attr("src").trim();
        newsFeedList.add(new NewsFeed(
            title.item(0).getTextContent().trim(),
            description.item(0).getTextContent().trim(),
            pubDate.item(0).getTextContent().trim(),
            guid.item(0).getTextContent().trim(),
            image,
            link.item(0).getTextContent().trim()
        ));
    }
    return newsFeedList;
}

以下是错误消息:

[致命错误] coronavirus:195:32: 实体名称必须紧随实体引用中的“&”之后。
org.xml.sax.SAXParseException; systemId: https://nation.com.pk/rss/coronavirus; lineNumber: 195; columnNumber: 32; 实体名称必须紧随实体引用中的“&”之后。
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:258)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
at com.i2p.covid19.service.NewsFeedService.updateNewsFeed(NewsFeedService.java:87)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...
英文:

I have written a scheduler in java spring boot application which runs once in every hour, since month it was working completely fine. But today it has started throwing exception while parsing. I guess may be the xml(from which I am getting the data is broken or may be it has changed little bit which I am unable to figure out).

Please note: I cannot change the source data.

Here is my code:

    @Scheduled(fixedRate = 1*60*60*1000 , initialDelay = 10*1000)
public String updateNewsFeed() {
try {
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
String URL = &quot;https://nation.com.pk/rss/coronavirus&quot;;
Document doc = db.parse(URL);
List&lt;NewsFeed&gt; newsFeedList = parseNewsItemsToList(doc);
return &quot;Works fine&quot;;
} catch (Exception ex) {
return ex.getMessage();
}
}
public List&lt;NewsFeed&gt; parseNewsItemsToList(Document doc) throws Exception{
doc.getDocumentElement().normalize();
NodeList nodes = doc.getElementsByTagName(&quot;item&quot;);
List&lt;NewsFeed&gt; newsFeedList = new ArrayList&lt;&gt;();
for (int i = 0; i &lt; nodes.getLength(); i++) {
Element element = (Element) nodes.item(i);
NodeList title = element.getElementsByTagName(&quot;title&quot;);
NodeList link = element.getElementsByTagName(&quot;link&quot;);
NodeList description = element.getElementsByTagName(&quot;description&quot;);
NodeList pubDate = element.getElementsByTagName(&quot;pubDate&quot;);
NodeList guid = element.getElementsByTagName(&quot;guid&quot;);
org.jsoup.nodes.Document htmlDoc = Jsoup.connect(link.item(0).getTextContent().trim()).get();
/*Elements pngs = htmlDoc.select(&quot;picture&quot;);
System.out.println(&quot;\nimg link:&quot;+pngs.toString());*/
String image = htmlDoc.select(&quot;picture&quot;).select(&quot;img[src~=(?i)\\.(png|jpe?g)]&quot;).attr(&quot;src&quot;).trim();
newsFeedList.add(new NewsFeed(
title.item(0).getTextContent().trim(),
description.item(0).getTextContent().trim(),
pubDate.item(0).getTextContent().trim(),
guid.item(0).getTextContent().trim(),
image,
link.item(0).getTextContent().trim()
));
}
return newsFeedList;
}

Here is the Error message:

[Fatal Error] coronavirus:195:32: The entity name must immediately follow the &#39;&amp;&#39; in the entity reference.
org.xml.sax.SAXParseException; systemId: https://nation.com.pk/rss/coronavirus; lineNumber: 195; columnNumber: 32; The entity name must immediately follow the &#39;&amp;&#39; in the entity reference.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:258)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
at com.i2p.covid19.service.NewsFeedService.updateNewsFeed(NewsFeedService.java:87)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84)
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

答案1

得分: 0

问题出在 XML 中的 &amp; 和字符。&lt;category&gt;Lifestyle &amp; Entertainment&lt;/category&gt;

在 XML 文档中,&amp; 是在 CDATA 部分外是不合法的。必须将其写为 &amp;amp;,但是 XML 文档的制作者已经转义了 &amp; 字符。

如果你用 &amp;amp; 替换 &amp;,它将起作用。

使用 ROMETOOLS 库(https://rometools.github.io/rome/)
如果你的目标是处理 RSS 源,我推荐使用 rome 库,该库处理像 &amp; 这样的特殊字符 - 使用起来简单明了。请参考 https://www.baeldung.com/rome-rss

下面的代码片段从 RSS 源的 &lt;title&gt; 标签中打印出 International News

URL feedSource = new URL("https://nation.com.pk/rss/coronavirus");
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(feedSource));
System.out.println(feed.getTitle());
英文:

Issue is the &amp; ampersand character in the XML.
&lt;category&gt;Lifestyle &amp; Entertainment&lt;/category&gt;

&amp; is an illegal in XML document outside CDATA section. This must be written as &amp;amp; but the producer of the XML document has already escaped the &amp; character.

If you replace the &amp; with &amp;amp; , it will work.

Use ROMETOOLS library (https://rometools.github.io/rome/)
If your target is process the RSS feeds, I recommend to use rome library which handles the special chars like &amp; - it is straightforward and simple to use. Refer https://www.baeldung.com/rome-rss

Below code snippet prints the International News from the &lt;title&gt; tag of the RSS feed:

URL feedSource = new URL(&quot;https://nation.com.pk/rss/coronavirus&quot;);
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(feedSource));
System.out.println(feed.getTitle());

huangapple
  • 本文由 发表于 2020年9月7日 18:48:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/63776084.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定