StAX:在空的XML文件上的START_DOCUMENT

huangapple go评论77阅读模式
英文:

StAX: START_DOCUMENT on empty XML file

问题

我正试图理解与START_DOCUMENT事件相关的StAX设计。典型的while循环如下:

XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
try {
  XMLEventReader xmlEventReader = xmlInputFactory.createXMLEventReader(new FileInputStream(fileName));
  while(xmlEventReader.hasNext()) {
    XMLEvent xmlEvent = xmlEventReader.nextEvent();
    switch(xmlEvent.getEventType()) {
	[...]

使用此循环无法区分空XML文件和仅具有xml版本的XML文件。例如:

% test -s empty.xml || echo empty      
empty
% cat start.xml 
<?xml version="1.0" encoding="UTF-8"?>

上述这两个文件产生完全相同的一系列StAX事件(一个START_DOCUMENT)。这种行为是否有文档记录?为什么在空文件的情况下会有START_DOCUMENT事件?

英文:

I am trying to understand StAX design with regards to START_DOCUMENT event. The typical while loop is:

XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
try {
  XMLEventReader xmlEventReader = xmlInputFactory.createXMLEventReader(new FileInputStream(fileName));
  while(xmlEventReader.hasNext()) {
    XMLEvent xmlEvent = xmlEventReader.nextEvent();
    switch( xmlEvent.getEventType() ) {
	[...]

Using this loop there is no way to distinguish in between an empty XML file vs an XML file with simply the xml version. Eg:

% test -s empty.xml || echo empty      
empty
% cat start.xml 
<?xml version="1.0" encoding="UTF-8"?>

The above two files produce exactly the same series of StAX events (one START_DOCUMENT). Is this behavior documented somewhere ? Why would anyone want a START_DOCUMENT event in the case of an empty file ?

答案1

得分: 1

如果您正在解析文件,而文件不包含格式良好的 XML,则唯一可以确定的是会报告错误。您描述的这两种情况(空文件和只包含 XML 声明的文件)都不是格式良好的,因此除了错误之外,您不能依赖任何其他内容。

话虽如此,如果我记得正确,即使在格式良好的情况下,不同的 StAX 解析器在报告事件序列方面也会有差异。值得用多个解析器对您的代码进行测试。

英文:

If you're parsing a file and the file doesn't contain well-formed XML, then the only thing you can be sure of is that an error will be reported. Neither of the two cases you describe (an empty file, and a file containing only an XML declaration) is well-formed, so you can't rely on anything except the error.

Having said that, if I recall correctly there are differences between StAX parsers in the sequence of events they report, even in cases that are well-formed. It's worth testing your code with more than one.

答案2

得分: 1

以下是翻译好的内容:

两个文件都可以被解析,因为XML声明是可选的。

两者都不是格式良好的(因为格式良好的XML必须具有根元素),但特别从StAX等事件解析器的角度来看,它们是相同的事物。

START_DOCUMENT事件之后,下一个hasNext调用应该抛出一个XMLStreamException,指示文档不是格式良好的。

英文:

Either file are equally parseable, as the XML declaration is optional.

Neither one is well-formed (because a well-formed XML must have a root element), but especially from the perspective of a event parser like StAX, these are the same thing.

After the START_DOCUMENT event the next hasNext call should throw a XMLStreamException indicating that the document isn't well-formed.

huangapple
  • 本文由 发表于 2020年9月10日 15:56:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/63825248.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定