2020年9月7日 18:48:12go评论100阅读模式

英文:

facing the org.xml.sax.SAXParseException Exception in xml parsing

问题

我在Java Spring Boot应用程序中编写了一个调度程序，它每小时运行一次，一个月来一直正常工作。但今天开始在解析时抛出异常。我猜可能是我从中获取数据的XML文件损坏了，或者可能发生了一些我无法找出的小改变。

请注意：我不能更改源数据。

以下是我的代码：

@Scheduled(fixedRate = 1*60*60*1000 , initialDelay = 10*1000)
public String updateNewsFeed() {
    try {
        DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        String URL = "https://nation.com.pk/rss/coronavirus";
        Document doc = db.parse(URL);
        List<NewsFeed> newsFeedList = parseNewsItemsToList(doc);
       
        return "工作正常";
    } catch (Exception ex) {
        return ex.getMessage();
    }
}
public List<NewsFeed> parseNewsItemsToList(Document doc) throws Exception{
    doc.getDocumentElement().normalize();
    NodeList nodes = doc.getElementsByTagName("item");
    List<NewsFeed> newsFeedList = new ArrayList<>();
    for (int i = 0; i < nodes.getLength(); i++) {
        Element element = (Element) nodes.item(i);
        NodeList title = element.getElementsByTagName("title");
        NodeList link = element.getElementsByTagName("link");
        NodeList description = element.getElementsByTagName("description");
        NodeList pubDate = element.getElementsByTagName("pubDate");
        NodeList guid = element.getElementsByTagName("guid");
        org.jsoup.nodes.Document htmlDoc = Jsoup.connect(link.item(0).getTextContent().trim()).get();
        String image = htmlDoc.select("picture").select("img[src~=(?i)\\.(png|jpe?g)]").attr("src").trim();
        newsFeedList.add(new NewsFeed(
            title.item(0).getTextContent().trim(),
            description.item(0).getTextContent().trim(),
            pubDate.item(0).getTextContent().trim(),
            guid.item(0).getTextContent().trim(),
            image,
            link.item(0).getTextContent().trim()
        ));
    }
    return newsFeedList;
}

以下是错误消息：

[致命错误] coronavirus:195:32: 实体名称必须紧随实体引用中的“&”之后。
org.xml.sax.SAXParseException; systemId: https://nation.com.pk/rss/coronavirus; lineNumber: 195; columnNumber: 32; 实体名称必须紧随实体引用中的“&”之后。
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:258)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
at com.i2p.covid19.service.NewsFeedService.updateNewsFeed(NewsFeedService.java:87)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...

英文:

I have written a scheduler in java spring boot application which runs once in every hour, since month it was working completely fine. But today it has started throwing exception while parsing. I guess may be the xml(from which I am getting the data is broken or may be it has changed little bit which I am unable to figure out).

Please note: I cannot change the source data.

Here is my code:

    @Scheduled(fixedRate = 1*60*60*1000 , initialDelay = 10*1000)
public String updateNewsFeed() {
try {
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
String URL = &quot;https://nation.com.pk/rss/coronavirus&quot;;
Document doc = db.parse(URL);
List&lt;NewsFeed&gt; newsFeedList = parseNewsItemsToList(doc);
return &quot;Works fine&quot;;
} catch (Exception ex) {
return ex.getMessage();
}
}
public List&lt;NewsFeed&gt; parseNewsItemsToList(Document doc) throws Exception{
doc.getDocumentElement().normalize();
NodeList nodes = doc.getElementsByTagName(&quot;item&quot;);
List&lt;NewsFeed&gt; newsFeedList = new ArrayList&lt;&gt;();
for (int i = 0; i &lt; nodes.getLength(); i++) {
Element element = (Element) nodes.item(i);
NodeList title = element.getElementsByTagName(&quot;title&quot;);
NodeList link = element.getElementsByTagName(&quot;link&quot;);
NodeList description = element.getElementsByTagName(&quot;description&quot;);
NodeList pubDate = element.getElementsByTagName(&quot;pubDate&quot;);
NodeList guid = element.getElementsByTagName(&quot;guid&quot;);
org.jsoup.nodes.Document htmlDoc = Jsoup.connect(link.item(0).getTextContent().trim()).get();
/*Elements pngs = htmlDoc.select(&quot;picture&quot;);
System.out.println(&quot;\nimg link:&quot;+pngs.toString());*/
String image = htmlDoc.select(&quot;picture&quot;).select(&quot;img[src~=(?i)\\.(png|jpe?g)]&quot;).attr(&quot;src&quot;).trim();
newsFeedList.add(new NewsFeed(
title.item(0).getTextContent().trim(),
description.item(0).getTextContent().trim(),
pubDate.item(0).getTextContent().trim(),
guid.item(0).getTextContent().trim(),
image,
link.item(0).getTextContent().trim()
));
}
return newsFeedList;
}

Here is the Error message:

[Fatal Error] coronavirus:195:32: The entity name must immediately follow the '&' in the entity reference. org.xml.sax.SAXParseException; systemId: https://nation.com.pk/rss/coronavirus; lineNumber: 195; columnNumber: 32; The entity name must immediately follow the '&' in the entity reference. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:258) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177) at com.i2p.covid19.service.NewsFeedService.updateNewsFeed(NewsFeedService.java:87) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

答案1

得分: 0

问题出在 XML 中的 & 和字符。<category>Lifestyle & Entertainment</category>

在 XML 文档中，& 是在 CDATA 部分外是不合法的。必须将其写为 &amp;，但是 XML 文档的制作者已经转义了 & 字符。

如果你用 &amp; 替换 &，它将起作用。

使用 ROMETOOLS 库（https://rometools.github.io/rome/）
如果你的目标是处理 RSS 源，我推荐使用 rome 库，该库处理像 & 这样的特殊字符 - 使用起来简单明了。请参考 https://www.baeldung.com/rome-rss

下面的代码片段从 RSS 源的 <title> 标签中打印出 International News：

URL feedSource = new URL("https://nation.com.pk/rss/coronavirus");
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(feedSource));
System.out.println(feed.getTitle());

英文:

Issue is the & ampersand character in the XML.
<category>Lifestyle & Entertainment</category>

& is an illegal in XML document outside CDATA section. This must be written as &amp; but the producer of the XML document has already escaped the & character.

If you replace the & with &amp; , it will work.

Use ROMETOOLS library (https://rometools.github.io/rome/)
If your target is process the RSS feeds, I recommend to use rome library which handles the special chars like & - it is straightforward and simple to use. Refer https://www.baeldung.com/rome-rss

Below code snippet prints the International News from the <title> tag of the RSS feed:

URL feedSource = new URL(&quot;https://nation.com.pk/rss/coronavirus&quot;);
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(feedSource));
System.out.println(feed.getTitle());

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

面对在XML解析中出现的org.xml.sax.SAXParseException异常

问题

答案1

Stripe PaymentIntent API可以使用自定义的卡片数据吗？

如何打印数组中两个参数相同对象的值之和

生成调用另一个类的静态方法并使用多个字段作为参数的代码。

Spring Data JPA规范在嵌套集合中搜索属性。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。