模式验证未报告所有缺少的子项。

huangapple go评论79阅读模式
英文:

Schema validation does not report all missing children

问题

根据提供的示例模式("big.xsd")和文档("big.xml"),使用lxml验证模式只报告前十个“缺失”的子元素(为了可读性插入了换行符):

>>> from lxml import etree
>>> schema_doc = etree.parse('big.xsd')
>>> schema = etree.XMLSchema(schema_doc)
>>>
>>> doc = etree.parse('big.xml')
>>> schema.assertValid(doc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/lxml/etree.pyx", line 3643, in lxml.etree._Validator.assertValid
lxml.etree.DocumentInvalid: Element 'root': Missing child element(s). 
Expected is one of ( C1, C2, C3, C4, C5, C6, C7, C8, C9, C10 )., line 2

这与xmllint的输出一致(我认为lxml将验证委托给了libxml2)(为了可读性插入了换行符):

$ xmllint --noout --schema big.xsd big.xml 
big.xml:2: element root: Schemas validity error : Element 'root': 
Missing child element(s). Expected is one of ( C1, C2, C3, C4, C5, C6, C7, C8, C9, C10 ).
big.xml fails to validate

是否有办法使lxml报告所有缺失的子元素,特别是必须符合模式的 D 元素?

英文:

Given this example schema ("big.xsd"):

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt;

&lt;xsd:schema xmlns:xsd=&quot;http://www.w3.org/2001/XMLSchema&quot;&gt;
  &lt;xsd:element name=&quot;root&quot;&gt;
    &lt;xsd:complexType&gt;
      &lt;xsd:sequence&gt;
        &lt;xsd:element name=&quot;A&quot;/&gt;
        &lt;xsd:element name=&quot;B&quot;/&gt;
        &lt;xsd:element name=&quot;C1&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;C2&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;C3&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;C4&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;C5&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;C6&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;C7&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;C8&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;C9&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;C10&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;C11&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;C12&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;C13&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;C14&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;C15&quot; minOccurs=&quot;0&quot;/&gt;
        &lt;xsd:element name=&quot;D&quot;/&gt;
      &lt;/xsd:sequence&gt;
    &lt;/xsd:complexType&gt;
  &lt;/xsd:element&gt;
&lt;/xsd:schema&gt;

and this example document ("big.xml"):

&lt;?xml version=&quot;1.0&quot; ?&gt;
&lt;root&gt;
  &lt;A/&gt;
  &lt;B/&gt;
&lt;/root&gt;

Validating the schema with lxml reports only the first ten "missing" children (line break inserted for readability):

&gt;&gt;&gt; from lxml import etree
&gt;&gt;&gt; schema_doc = etree.parse(&#39;big.xsd&#39;)
&gt;&gt;&gt; schema = etree.XMLSchema(schema_doc)
&gt;&gt;&gt;
&gt;&gt;&gt; doc = etree.parse(&#39;big.xml&#39;)
&gt;&gt;&gt; schema.assertValid(doc)
Traceback (most recent call last):
  File &quot;&lt;stdin&gt;&quot;, line 1, in &lt;module&gt;
  File &quot;src/lxml/etree.pyx&quot;, line 3643, in lxml.etree._Validator.assertValid
lxml.etree.DocumentInvalid: Element &#39;root&#39;: Missing child element(s). 
Expected is one of ( C1, C2, C3, C4, C5, C6, C7, C8, C9, C10 )., line 2

This is consistent with xmllint's output (I believe lxml delegates the validation to libxml2) (line break inserted for readability):

$ xmllint --noout --schema big.xsd big.xml 
big.xml:2: element root: Schemas validity error : Element &#39;root&#39;: 
Missing child element(s). Expected is one of ( C1, C2, C3, C4, C5, C6, C7, C8, C9, C10 ).
big.xml fails to validate

Is there a way to make lxml report all the missing children, in particular the D element which is required to conform to the schema?


Notes

  • The actual schema is from a third party, so it cannot be changed.

  • Since the codebase I'm working with already depends on lxml I'm not asking for other packages (such as xmlschema) which might produce more useful error messages. I want to avoid adding more dependencies if possible.

答案1

得分: 2

>有没有办法让lxml报告所有缺失的子元素?

我不知道,但我认为一个模式处理器会以这种方式进行定制化的可能性非常小。

我想在Saxon上尝试一下。它输出:

> 在test.xml的第5行第8列的验证错误:FORG0001:在元素&lt;root&gt;的内容中:内容不完整。如果后面跟着&lt;Q{}D&gt;,它将是有效的。

并不是一个完美的错误消息(例如,我想知道用户有多广泛地理解&lt;Q{}D&gt;符号),但它似乎捕捉到了你要找的内容。

Saxon花了很多精力分析这种情况。它到达子元素列表的末尾,并发现有限状态机的状态不是合法的“最终状态”。它不只是单调地报告这个问题,而是查看从这个状态到所有可能的“最终状态”的过渡,看看是否有一个过渡会导致一个合法的最终状态,并发现只有一个,即一个D元素。在这个特定的情况下,这种策略很有效。相比之下,libxml2只是列出可能出现的下一个元素,并截断该列表,以免变得非常长。

总的来说,对于验证器来说,判断内容是否无效相对容易,但要解释出错的原因就难得多,这基本上意味着找出使文档从无效文档变为有效文档的最小或最有可能的更改,没有一种策略能够在所有情况下都成功地实现这一点。

英文:

>Is there a way to make lxml report all the missing children

I don't know, but I think it's very unlikely that a schema processor would be customisable in this way.

I thought I would try this one on Saxon. It outputs:

> Validation error on line 5 column 8 of test.xml: FORG0001: In
> content of element &lt;root&gt;: The content is incomplete. It would be
> valid if followed by &lt;Q{}D&gt;.

Not a perfect error message (I wonder how widely users understand the notation &lt;Q{}D&gt;, for example) but it seems to capture what you are looking for.

Saxon goes to a lot of effort to analyse the situation. It gets to the end of the list of children, and finds that the state of the finite state machine is not a legitimate "final state". Rather than just reporting this blandly, it looks at all the possible transitions from this state to see if there is one that would lead to a legitimate final state, and finds that there is only one, namely a D element. On this particular occasion, that strategy works well. libxml2, by contrasts, contents itself with listing the elements that could have occurred next, and truncating that list so it doesn't get ridiculously long.

In general it's fairly easy for a validator to work out that the content is invalid, it's much harder to explain what's wrong, which essentially means finding the minimum or most likely change to the document that would turn it from an invalid document into a valid one, and no strategy is going to do that successfully all of the time.

huangapple
  • 本文由 发表于 2023年2月24日 17:22:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/75554740.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定