Go:一次只解码一个XML节点

huangapple go评论86阅读模式
英文:

Go: Decoding only one XML node at a time

问题

在查看encoding/xml包的源代码时,所有的解组逻辑(用于解码实际的XML节点并对其进行类型化)都在unmarshal函数中,唯一调用该函数的方式实际上是通过调用DecodeElement函数。然而,解组逻辑本身也会隐式地搜索下一个EndElement节点。这主要是为了验证。然而,这对我来说似乎是一个重大的设计缺陷:如果我有一个庞大的XML文件,我对其结构足够自信,我只想逐个解码节点,以便能够高效地在数据中进行筛选,那该怎么办呢?可以使用RawToken()函数获取当前标签,这很好,但是当你对其调用DecodeElement()函数时,当不可避免的unmarshal()调用开始遇到它认为不平衡的节点时,就会出现错误。

理论上可能会遇到一个我想解码的标记,捕获偏移量,解码元素,返回到先前位置并循环,但这仍然会导致大量不必要的处理。

难道没有一种只解析一个节点的方法吗?

英文:

Looking through the sourcecode for encoding/xml package, all of the unmarshaling logic (which decodes the actual XML nodes and types them) is in unmarshal and the only way to invoke this is essentially by calling DecodeElement. However, the unmarshaling logic also inherently searches-out the next EndElement. The predominant reason for this seems to be validation. However, this seems to represent a major design flaw to me: What if I have a massive XML file, I am sufficiently confident in its structure, and I'd just like to decode a single node at a time so that I can efficiently filter through the data on-the-fly? The RawToken() call can be used to get the current tag, which is great, but, obviously, when you call DecodeElement() on it, there's an error when the inevitable unmarshal() call apparently starts running into nodes in a way that it perceives as unbalanced.

It seems theoretically possible to encounter a token that I'd like to decode, capture the offset, decode the element, seek back to the previous position, and loop, but that'd still result in a massive amount of unnecessary processing.

Is there no way to just parse one node at a time?

答案1

得分: 2

你描述的是XML流解析,就像任何SAX解析器一样。好消息是,encoding/xml库支持这一功能,尽管它有点隐藏。

你需要做的是创建一个xml.Decoder实例,传入一个io.Reader。然后,你可以使用Decoder.Token()方法读取输入流,直到找到下一个有效的XML标记。从那里,你可以决定下一步该做什么。

以下是一个简单的示例代码:

package main

import (
	"bytes"
	"encoding/xml"
	"fmt"
)

const (
	book = `<?xml version="1.0" encoding="UTF-8"?>
<book>
  <preface>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</preface>
  <chapter num="1" title="Foo">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</chapter>
  <chapter num="2" title="Bar">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</chapter>
</book>`
)

type Chapter struct {
	Num     int    `xml:"num,attr"`
	Title   string `xml:"title,attr"`
	Content string `xml:",chardata"`
}

func main() {
	// 模拟文件或网络流
	b := bytes.NewBufferString(book)

	// 设置解码器
	d := xml.NewDecoder(b)

	for {
		// 寻找下一个标记
		t, err := d.Token()
		if err != nil {
			break
		}

		switch et := t.(type) {
		case xml.StartElement:
			// 检查我们是否对该元素感兴趣
			if et.Name.Local == "chapter" {
				c := &Chapter{}
				// 解码元素(自动推进流)
				if err := d.DecodeElement(&c, &et); err != nil {
					panic(err)
				}
				// 打印我们感兴趣的内容
				fmt.Printf("%d: %s\n", c.Num, c.Title)
			} else if et.Name.Local == "book" {
				fmt.Println("Book begins!")
			}
		case xml.EndElement:
			if et.Name.Local != "book" {
				continue
			}
			fmt.Println("Finished processing book!")
		}
	}
}

你可以在Gist上找到这个示例的代码,或者在Playground上运行它。

英文:

What you describe is called XML stream parsing as it is done by any SAX parser, for example. Good news: encoding/xml supports that, albeit it is a bit hidden.

What you actually have to do is to create an instance of xml.Decoder, passing an io.Reader. Then you will use Decoder.Token() to read the input stream until the next valid xml token found. From there, you can decide what to do next.

Here is a little example also available as gist, or you can <kbd>Run it on PlayGround</kbd>:

<!-- language: lang-go -->

package main
import (
&quot;bytes&quot;
&quot;encoding/xml&quot;
&quot;fmt&quot;
)
const (
book = `&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;book&gt;
&lt;preface&gt;Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.&lt;/preface&gt;
&lt;chapter num=&quot;1&quot; title=&quot;Foo&quot;&gt;Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.&lt;/chapter&gt;
&lt;chapter num=&quot;2&quot; title=&quot;Bar&quot;&gt;Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.&lt;/chapter&gt;
&lt;/book&gt;`
)
type Chapter struct {
Num     int    `xml:&quot;num,attr&quot;`
Title   string `xml:&quot;title,attr&quot;`
Content string `xml:&quot;,chardata&quot;`
}
func main() {
// We emulate a file or network stream
b := bytes.NewBufferString(book)
// And set up a decoder
d := xml.NewDecoder(b)
for {
// We look for the next token
// Note that this only reads until the next positively identified
// XML token in the stream
t, err := d.Token()
if err != nil  {
break
}
switch et := t.(type) {
case xml.StartElement:
// We now have to inspect wether we are interested in the element
// otherwise we will advance
if et.Name.Local == &quot;chapter&quot; {
// Most often/likely element first
c := &amp;Chapter{}
// We decode the element into(automagically advancing the stream)
// If no matching token is found, there will be an error
// Note the search only happens within the parent.
if err := d.DecodeElement(&amp;c, &amp;et); err != nil {
panic(err)
}
// We have found what we are interested in, so we print it
fmt.Printf(&quot;%d: %s\n&quot;, c.Num, c.Title)
} else if et.Name.Local == &quot;book&quot; {
fmt.Println(&quot;Book begins!&quot;)
}
case xml.EndElement:
if et.Name.Local != &quot;book&quot; {
continue
}
fmt.Println(&quot;Finished processing book!&quot;)
}
}
}

huangapple
  • 本文由 发表于 2016年1月23日 08:28:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/34958199.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定