Go语言中的通用XML解析器

huangapple go评论101阅读模式
英文:

General XML parser in Go

问题

在Go语言中,读取XML文档有一些常用的方法。有类似于C#中的XmlDocument或XDocument的方法吗?

我找到的所有示例都展示了如何使用解组(unmarshaling)功能将XML读取到需要定义的对象中,但这需要花费很多时间,因为我需要定义很多我不打算使用的内容。

xml.Unmarshal(...)

另一种方法是使用顺序读取(forward only reading):

xml.NewDecoder(xmlFile)

在这里有详细描述:http://blog.davidsingleton.org/parsing-huge-xml-files-with-go/

英文:

Is there some general approach of reading XML document in Go? Something similar to XmlDocument or XDocument in C#?

All the examples I found show how to read using unmarshaling functionality into the objects I need to define, but it's quite time consuming as I need to define a lot of staff that I'm not going to use.

xml.Unmarshal(...)

Another approach is forward only reading using:

xml.NewDecoder(xmlFile) 

Described here: http://blog.davidsingleton.org/parsing-huge-xml-files-with-go/

答案1

得分: 6

我找到的所有示例都展示了如何使用解组功能将数据读取到需要定义的对象中,但这需要花费很多时间,因为我需要定义很多我不打算使用的内容。

那么,不要定义你不打算使用的内容,只定义你打算使用的内容。你不必创建一个完全覆盖 XML 结构的 Go 模型。

假设你有一个如下的 XML:

<blog id="1234">
    <meta keywords="xml,parsing,partial" />
    <name>Partial XML parsing</name>
    <url>http://somehost.com/xml-blog</url>
    <entries count="2">
        <entry time="2016-01-19 08:40:00">
            <author>Bob</author>
            <content>First entry</content>
        </entry>
        <entry time="2016-01-19 08:30:00">
            <author>Alice</author>
            <content>Second entry</content>
        </entry>
    </entries>
</blog>

假设你只需要从这个 XML 中获取以下信息:

  • id
  • keywords
  • 博客名称
  • 作者名称

你可以使用以下结构来建模这些想要的信息:

type Data struct {
    Id      string   `xml:"id,attr"`
    Meta    struct {
        Keywords string `xml:"keywords,attr"`
    } `xml:"meta"`
    Name    string   `xml:"name"`
    Authors []string `xml:"entries>entry>author"`
}

现在,你可以使用以下代码仅解析这些信息:

d := Data{}
if err := xml.Unmarshal([]byte(s), &d); err != nil {
    panic(err)
}
fmt.Printf("%+v", d)

输出结果(在 Go Playground 上尝试):

{Id:1234 Meta:{Keywords:xml,parsing,partial} Name:Partial XML parsing Authors:[Bob Alice]}
英文:

> All the examples I found show how to read using unmarshaling functionality into the objects I need to define, but it's quite time consuming as I need to define a lot of staff that I'm not going to use.

Then don't define what you're not going to use, define only what you're going to use. You don't have to create a Go model that perfectly covers the XML structure.

Let's assume you have an XML like this:

&lt;blog id=&quot;1234&quot;&gt;
	&lt;meta keywords=&quot;xml,parsing,partial&quot; /&gt;
	&lt;name&gt;Partial XML parsing&lt;/name&gt;
	&lt;url&gt;http://somehost.com/xml-blog&lt;/url&gt;
	&lt;entries count=&quot;2&quot;&gt;
		&lt;entry time=&quot;2016-01-19 08:40:00&quot;&gt;
			&lt;author&gt;Bob&lt;/author&gt;
			&lt;content&gt;First entry&lt;/content&gt;
		&lt;/entry&gt;
		&lt;entry time=&quot;2016-01-19 08:30:00&quot;&gt;
			&lt;author&gt;Alice&lt;/author&gt;
			&lt;content&gt;Second entry&lt;/content&gt;
		&lt;/entry&gt;
	&lt;/entries&gt;
&lt;/blog&gt;

And let's assume you only need the following info out of this XML:

  • id
  • keywords
  • blog name
  • authors names

You can model these wanted pieces of information with the following struct:

type Data struct {
	Id   string `xml:&quot;id,attr&quot;`
	Meta struct {
		Keywords string `xml:&quot;keywords,attr&quot;`
	} `xml:&quot;meta&quot;`
	Name    string   `xml:&quot;name&quot;`
	Authors []string `xml:&quot;entries&gt;entry&gt;author&quot;`
}

And now you can parse only these information with the following code:

d := Data{}
if err := xml.Unmarshal([]byte(s), &amp;d); err != nil {
	panic(err)
}
fmt.Printf(&quot;%+v&quot;, d)

Output (try it on the Go Playground):

{Id:1234 Meta:{Keywords:xml,parsing,partial} Name:Partial XML parsing Authors:[Bob Alice]}

答案2

得分: 3

好的,以下是翻译好的内容:

首先,你不必使用encoding/xml来定义映射到complex元素的Go类型来解析XML。相反,你可以纯粹地按过程解析XML文档,并且只对原始(非嵌套)元素调用xml.Unmarshal(),将它们解析为"primitive"类型的值(如stringint32time.Time等)。

这当然会产生很多代码,但这只是从更动态的角度来解决同样的问题。为了理解我的意思,考虑将完全解析的XML文档表示为DOM对象的形式。要从中提取有用的数据,你必须以某种方式查询该对象或遍历整个树。使用你提到的博客文章中的方法,你在解析XML文档时遍历它,实质上将解析与查询/遍历结合在一起。

这种方法可能适用于你,也可能不适用,因为将XML格式的数据解析为特定方法的适用性高度取决于其结构和解析的预期结果。例如,如果你需要对文档执行多个查询,并且后续查询依赖于前面的查询结果,那么从该博客文章中的过程化解码方法几乎不起作用。

其次,存在其他的库。例如,看看xmltreexmlpath。虽然这两个库都是用纯Go编写的,但也有一些包装libxml的包,例如goxml。使用它们,你可以选择DOM导向的解析方式。

另一种方法是使用mxj将XML解析为一组嵌套的键/值映射。

英文:

Well, two things.

First, you are not obliged to define Go types which map to complex elements to parse XML with nothing but encoding/xml.
On the contrary, you can parse XML documents purely procedurally and calling xml.Unmarshal() only on primitive (non-nested) elements&mdash;to parse them as values of "primitive" types (such as string or int32 or time.Time etc).

That would be a lot of code, for sure, but that's just approaching the same problem from a more dynamic angle. To understand what I mean, consider your fully-parsed XML document in the form of a DOM object. To extract useful data from it, you have to query that object somehow or iterate over the tree. With the approach the blog post you've referred to presents, you traverse your XML document as you parse it&mdash;essentially combining parsing with querying/traversing.

This may or may not work for you as applicability of a particular approach to parsing of XML-formatted datum highly depends on its structure and the intended outcome of its parsing. For instance, if you need to perform several queries over the document with the later queries depending on the former, procedural decoding from that blog post hardly works.

Second, alternative libraries exist. For instance, look at xmltree and xmlpath.
While these two are written in pure Go, there exist a couple of packages wrapping libxml, for instance, goxml. With them, you can have DOM-oriented parsing if you like.

Yet another approach is to parse XML into a set of nested key/value maps using mxj.

huangapple
  • 本文由 发表于 2016年1月19日 00:01:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/34859030.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定