基本的DOM XML解析器需要什么?

huangapple go评论120阅读模式
英文:

What does a basic DOM XML parser need?

问题

我开始使用谷歌的Go语言进行编程,我尝试编写的包是用于处理和创建DOCX文件的API(我对这个主题很熟悉,所以认为这是学习Go的好方法)。由于DOCX文件主要是一个包含各种XML文件的ZIP文件,所以我需要一个DOM XML解析器。然而,我找不到任何原生的Go DOM XML解析器,因为我看到的唯一的解析器似乎非常有限,可能是SAX解析器(如果有使用Go的人,请纠正我)。

所以上个周末,我编写了一个非常基本的DOM XML解析器,能够解析DOCX包中的一个较简单的XML文件,并将其完整地输出。目前我不打算处理命名空间、XSLT或模式验证支持,因为这些对于操作DOCX文件没有用处。我的问题是,还有哪些XML标准和功能对于解析器来说是重要的?

目前,它只是创建了一个元素和属性的树,我可以修改和保存。我目前还没有处理CDATA元素或XML转义字符(虽然这些很容易做,我会在本周末处理)。

英文:

I've started programming in Google's Go Language, and the package I'm attempting to write is an API for processing and creating DOCX files (I'm familiar with this topic and thought it would be a good way to learn Go). As DOCX files are primarly a ZIP file with various XML files inside them, I rather need a DOM XML parser. However, I was unable to find any native Go DOM XML Parsers, as the only ones I saw seemed to be very limited, and probably SAX parsers (anyone who uses Go, correct me if I'm wrong).

So this past weekend I wrote a very basic DOM XML parser that was able to parse one of the simpler XML files within the DOCX package and output it back intact. At the moment I'm not going to bother with Namespace, XSLT, or schema validation support, as those aren't useful for manipulating DOCX files. My question is, what other XML standards and functionality would be important to incorporate into the parser?

At the moment, it only really just creates a tree of elements and attributes, which I can modify and save. I'm not current handling CDATA elements or XML escape characters (though those would be easy to do and I'll get to that this weekend).

答案1

得分: 3

首先,如果您特别想要进行DOM解析,您需要实现DOM API。但我不确定您是否真的是这个意思;也许您只是想要一个生成XML树模型("dom")的XML解析器;或者只是一个XML解析器?DOM并不是唯一的方式。

此外,请注意,使用SAX解析器实现DOM树模型是最常见的方式;很少有DOM包内置解析器,通常解析器是单独公开的。

至于XML解析器的功能,我认为以下是必需的:

  • 处理字符实体(ampersand和数字)、预定义的通用实体(lt、gt、apos、quot)
  • 处理xml声明(
  • 处理各种输入编码;由xml声明或外部声明 -- 太多的解析器在这方面偷工减料,但这非常重要,因为xml文档可以可靠地在内部检测编码。
  • 检查属性值的唯一性
  • 检查元素的正确嵌套
  • 跳过注释
  • 跳过(如果不处理)处理指令
  • CDATA处理 -- 这很简单
  • 跟踪行号以进行错误报告

其他可能有用的功能包括:

  • 命名空间处理
  • 检查字符的有效性,包括内容和名称
  • 根据xml规范对换行符进行规范化
英文:

First of all: if you specifically want to do DOM parser, you need to implement DOM API. But I am not sure if you actually mean that; perhaps you just mean an XML parser that produces XML tree model ("dom"); or just an XML parser? DOM is hardly the only way.
Also note that implementing DOM tree model using SAX parser is the most common way; few if any DOM packages have embedded parsers, commonly parser is exposed separately.

As to XML parser features, some of things that are MUSTs in my opinion are:

  • Handling of character entities (ampersand and number), pre-defined general entities (lt, gt, apos, quot)
  • Handling of xml declaration (<?xml ... ?>)
  • Handling of various input encodings; declared by xml declaration or externally -- too many parsers skimp on this, but is very imporant since xml documents can reliably detect encoding internally.
  • Checking for uniqueness of attribute values
  • Checking for proper nesting of elements
  • Skipping of comments
  • Skippping (if not handling) of processing instructions
  • CDATA handling -- it's simple to do
  • Keeping track of line numbers for error reporting

Other eventually useful things are:

  • Namespace handling
  • Checking of character validity, both content and names
  • Normalization of lineefeds as per xml specification

答案2

得分: 1

你有没有看过Go的XML解析器?http://golang.org/pkg/xml/

如果它缺少你需要的功能,可能还是比自己编写更容易添加。

英文:

Have you looked at Go's XML parser? http://golang.org/pkg/xml/

If it is missing functionality you need, it's probably still easier to add than roll your own.

huangapple
  • 本文由 发表于 2010年9月15日 08:09:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/3713811.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定