编写一个在PHP中的XML linter,但是XMLReader和XML解析器都无法处理解析错误。

huangapple go评论56阅读模式
英文:

Writing a XML linter in PHP, but both XMLReader and XML parser can't handle parsing error

问题

我被指派编写一个用 PHP8 写的 XML 格式检查器,它应该充当一个 Web API。该 XML 格式检查器必须以详细模式工作,遍历整个文档并记录找到的每个错误(最多 1000 个错误),包括行号(是的,我知道 XML 可以是单行,但这是强制要求)。

换句话说,我需要一个能够完成以下操作的 XML 读取器/解析器模块:

1.【强制】处理中等到大型大小的 XML 文档(100MB~1GB)。
2.【强制】跳过错误并继续解析,如果可能的话。
3.【强制】编写自己的检查器代码来验证 TEXT 节点的值。
4.【强制】获取当前节点的行号。

但经过一些研究,我发现 PHP 内置的 XML 扩展都不能满足这些要求。

例如,这是一个“错误”的 XML,其中第 5 行的闭合标记 (<AuthorityCode>...</Authority>) 和第 11 行的闭合标记 (<LastUpdateTime>...</LastUpdate>) 与起始标记不匹配:

<?xml version="1.0"?>
<FacilityList>
    <UpdateTime>2022-09-09T08:00:00+08:00</UpdateTime>
    <UpdateInterval type="SEMIAUTO">-1</UpdateInterval>
    <AuthorityCode>CA</Authority>
    <Facility>
        <FacilityID>NFB-NR-P00501-013037-SN-S9K6VPJ36-0002</FacilityID>
        <FacilityClass>01</FacilityClass>
        <FacilityType>003</FacilityType>
        <LocationType>1</LocationType>
        <LastUpdateTime>2022-10-04T13:00:00+08:00</LastUpdate>
    </Facility>
</FacilityList>

libxml 中的 xmllint 工具将显示第 5 行和第 11 行的所有错误,但 XMLReaderXML Parser 将停止在第 5 行,不会继续前进,并且我找不到绕过它的方法。是的,我已经在 XMLReader 中设置了 XML_PARSE_RECOVER 标志:

libxml_use_internal_errors(true);
$parser = new XMLReader();
$parser->open($filename, null, LIBXML_NOERROR | LIBXML_NOWARNING | 1);

但这不起作用(PHP 8.2.6)。

我是做错了什么吗,还是用内置的 XMLReader / XML expat 解析器无法实现我想要的功能?

DOMDocument 可以处理并报告错误,但我不想将整个 1GB 的数据加载到内存中。

[编辑]
不,我不是要求第三方产品,而只是想知道如何使用 PHP 内置函数。就像在 XMLReader / XML expat 解析器中的某些魔术选项,或者基于来自流式源的部分数据进行 DOMDocument 解析的示例代码。或者至少告诉我:“你不能在 PHP 中做到这一点”。

我已经检查了许多第三方库,但它们都不能满足我的要求。它们要么只提供 XML expat 解析器的包装,要么依赖于 DOMDocument 在开始时将所有内容加载到内存中。

=====

顺便问一下,有没有一种可靠的方法可以从 XMLReader 中获取行号?是的,我知道 XMLReader::expand() 的诀窍,但当 XML 格式错误(比如错过闭合标记)时,它不起作用。

尝试自己计算 \n\r 的数量也不起作用,因为 XMLReader<FacilityList> 之前不会报告任何内容:<?xml version="1.0"?> 和后面的空白都被完全忽略。

英文:

I'm tasked to write a XML linter in PHP8 and it shall server as a web API. This XML linter must work in verbose mode that goes through the whole document and log every error found (up to 1000 errors) with line number (yes I know XML can one single-line but it's a mandatory requirement).

In other words, I need a XML reader/parser module that can:

  1. [mandatory] process medium to large size XML documents (100MB~1GB).
  2. [mandatory] surpass error and keep parsing, if possible.
  3. [mandatory] write my own checker code to validate the value of TEXT node.
  4. [mandatory] get line number of current node.

But after some study, none of the PHP built-in XML extensions can satisfy these requirements.

For example here is a "bad" XML that the closing tags at line 5 (&lt;AuthorityCode&gt;...&lt;/Authority&gt;) & line 11 (&lt;LastUpdateTime&gt;...&lt;/LastUpdate&gt;) mismatches with starting tags:

&lt;?xml version=&quot;1.0&quot;?&gt;
&lt;FacilityList&gt;
	&lt;UpdateTime&gt;2022-09-09T08:00:00+08:00&lt;/UpdateTime&gt;
	&lt;UpdateInterval type=&quot;SEMIAUTO&quot;&gt;-1&lt;/UpdateInterval&gt;
	&lt;AuthorityCode&gt;CA&lt;/Authority&gt;
	&lt;Facility&gt;
		&lt;FacilityID&gt;NFB-NR-P00501-013037-SN-S9K6VPJ36-0002&lt;/FacilityID&gt;
		&lt;FacilityClass&gt;01&lt;/FacilityClass&gt;
		&lt;FacilityType&gt;003&lt;/FacilityType&gt;
		&lt;LocationType&gt;1&lt;/LocationType&gt;
		&lt;LastUpdateTime&gt;2022-10-04T13:00:00+08:00&lt;/LastUpdate&gt;
	&lt;/Facility&gt;
&lt;/FacilityList&gt;

The xmllint tool from libxml will show all errors at line 5 and line 11, but both XMLReader and XML Parser will just stop at line 5 and won't go further, and I can't find a way to bypass it. Yes I've already set the XML_PARSE_RECOVER flag in XMLReader:

libxml_use_internal_errors(true);	
$parser = new XMLReader();
$parser-&gt;open($filename,null,LIBXML_NOERROR|LIBXML_NOWARNING|1);

And it doesn't work (PHP 8.2.6).

Did I do something wrong, or it's just not possible to do what I wanted using built-in XMLReader / XML expat parser ?
The DOMDocument can process and report both errors, but I don't want to load the whole 1GB data into memory.

[EDIT]
No I'm not asking for a 3rd party products but just want to know what should I do with PHP built-in functions. Like some sort of magic options in XMLReader / XML expat parser, or example codes to make DOMDocument parsing based on partial data from a streaming source. Or at least just tell me that "you can't do this in PHP".

I've already checked many 3rd party libraries but none of them can do what I wanted. They either just provide a wrapper of XML expat parser, or relies on DOMDocument to load everything into memory in the beginning.

=====

BTW, is there any reliable way to get line number from XMLReader ? Yes I know the XMLReader::expand() trick but it just doesn't work when the XML is badly formatted (such as mission closing tag).

Trying to count the number of \n and \r by myself doesn't work either, because XMLReader doesn't report anything before &lt;FacilityList&gt;: the &lt;?xml version=&quot;1.0&quot;?&gt; and the following whitespace are totally ignored.

答案1

得分: 0

从其他人的评论中看,对于我的问题,答案似乎是:“你不能在PHP中这样做”。

英文:

OK from the comments from other people, the answer for my question seems to be "NO YOU CAN'T DO THAT IN PHP".

huangapple
  • 本文由 发表于 2023年5月30日 09:50:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76361152.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定