如何在Go中解析包含各种元素的大型XML文件?

huangapple go评论78阅读模式
英文:

How to parse huge XML file with various elements in Go?

问题

你可以使用Go语言来解析一个包含各种元素(即不同元素重复多次)的大型XML文件。

例如:

<stuff>
    <header>...</header>
    <item>...</item>
    ...
    <item>...</item>
    <something>...</something>
</stuff>

我想编写一个Go脚本,可以将这个文件拆分为多个包含特定数量标签的较小文件。所有关于如何使用Go解析XML的示例似乎都依赖于知道文件中存在的元素。

是否可以在不知道元素的情况下解析文件?类似于对XML中的每个元素进行迭代,无论元素是什么(header、item、something等)...

英文:

How can you parse a huge XML file that's having various elements (i.e. not same element repeated multiple times).

Example:

&lt;stuff&gt;
    &lt;header&gt;...&lt;/header&gt;
    &lt;item&gt;...&lt;/item&gt;
    ...
    &lt;item&gt;...&lt;/item&gt;
    &lt;something&gt;...&lt;/sometihng&gt;
&lt;/stuff&gt;

I want to write a script in Go that would allow me to split this file in multiple smaller files with specific amount of tags per file.
All examples on how to parse XML with Go seems to rely on knowing the elements that you have in the file.

Can the file be parsed without knowing that? Something like for each element in XML no matter what element is there (header, item, something, etc...)

答案1

得分: 22

使用标准的xml Decoder

调用Token逐个读取标记。当找到感兴趣的开始元素时,调用DecodeElement将元素解码为Go值。

以下是如何使用解码器的示意图:

d := xml.NewDecoder(r)
for {
    t, tokenErr := d.Token()
    if tokenErr != nil {
        if tokenErr == io.EOF {
            break
        }
        // 处理错误
        return fmt.Errorf("解码标记:%v", err)
    }
    switch t := t.(type) {
    case xml.StartElement:
        if t.Name.Space == "foo" && t.Name.Local == "bar" {
            var b bar
            if err := d.DecodeElement(&b, &t); err != nil {
                // 处理错误
                return fmt.Errorf("解码元素 %q:%v", t.Name.Local, err)
            }
            // 对b进行操作
        }
    }
}
英文:

Use the standard xml Decoder.

Call Token to read tokens one by one. When a start element of interest is found, call DecodeElement to decode the element to a Go value.

Here's a sketch of how to use the decoder:

d := xml.NewDecoder(r)
for {
	t, tokenErr := d.Token()
	if tokenErr != nil {
        if tokenErr == io.EOF {
           break
        }
		// handle error somehow
        return fmt.Errorf(&quot;decoding token: %v&quot;, err)
	}
	switch t := t.(type) {
	case xml.StartElement:
        if t.Name.Space == &quot;foo&quot; &amp;&amp; t.Name.Local == &quot;bar&quot; {
            var b bar
            if err := d.DecodeElement(&amp;b, &amp;t); err != nil {
		        // handle error somehow
                return fmt.Errorf(&quot;decoding element %q: %v&quot;, t.Name.Local, err)
            }
            // do something with b
        }
    }
}

答案2

得分: 1

这不仅仅是Go语言的限制,也是XML的限制。XML元素的意义只能根据其模式来确定(模式预定义了哪些元素可以包含在其他元素中)。

英文:

This isn't so much a limit of Go as a limit of xml. XML elements only make sense according to their schema (which predefines what elements are in other elements).

答案3

得分: 1

你应该查看 SAX 解析器,类似于 https://github.com/kokardy/saxlike

英文:

You should look to SAX parses, something like https://github.com/kokardy/saxlike

答案4

得分: 0

你还可以查看以下已经在处理大型 XML 文件时进行了测试的库。它已经被编写用来解决 Go 默认的 XML 包中的性能问题。

XML 流解析器:xml stream parser

英文:

you can also check following library which has been tested with big xml files. It has written to address the performance issue in go default xml package.

xml stream parser

huangapple
  • 本文由 发表于 2016年4月14日 21:56:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/36625345.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定