标准的 XML 解析器在 Golang 中性能非常低。

huangapple go评论120阅读模式
英文:

Standard xml parser has very low performance in Golang

问题

我有一个100GB的XML文件,并使用Go语言的SAX方法解析它,以下是代码:

    file, err := os.Open(filename)
	handle(err)
	defer file.Close()
	buffer := bufio.NewReaderSize(file, 1024*1024*256) // 33554432
	decoder := xml.NewDecoder(buffer)
    for {
    		t, _ := decoder.Token()
    		if t == nil {
    			break
    		}
    		switch se := t.(type) {
    		case xml.StartElement:
    			if se.Name.Local == "House" {
    				house := House{}
    				err := decoder.DecodeElement(&house, &se)
    				handle(err)
    			}
    		}
    	}

但是Go语言的执行速度非常慢,似乎是由于执行时间和磁盘使用量的原因。我的硬盘驱动器的读取速度大约为100-120 MB/s,但是Go语言只使用了10-13 MB/s。

为了进行实验,我将这段代码改写为C#:

    using (XmlReader reader = XmlReader.Create(filename))
    {
        while (reader.Read())
        {
            switch (reader.NodeType)
            {
                case XmlNodeType.Element:
                    if (reader.Name == "House")
                    {
                        //Code
                    }
                    break;
            }
        }
    }

这样我就可以充分利用硬盘驱动器,C#以100-110 MB/s的速度读取数据,并且执行时间大约是Go语言的10倍。

如何提高使用Go语言解析XML的性能呢?

英文:

I have a 100GB XML file and parse it with SAX method in go with this code

    file, err := os.Open(filename)
	handle(err)
	defer file.Close()
	buffer := bufio.NewReaderSize(file, 1024*1024*256) // 33554432
	decoder := xml.NewDecoder(buffer)
    for {
    		t, _ := decoder.Token()
    		if t == nil {
    			break
    		}
    		switch se := t.(type) {
    		case xml.StartElement:
    			if se.Name.Local == "House" {
    				house := House{}
    				err := decoder.DecodeElement(&house, &se)
    				handle(err)
    			}
    		}
    	}

But golang working very slow, its seems by execution time and disk usage. My HDD capable to read data with speed around 100-120 MB/s, but golang uses only 10-13 MB/s.

For experiment I rewrite this code in C#:

    using (XmlReader reader = XmlReader.Create(filename)
                {
                    while (reader.Read())
                    {
                        switch (reader.NodeType)
                        {
                            case XmlNodeType.Element:
                                if (reader.Name == "House")
                                {
                                    //Code
                                }
                                break;
                        }
                    }
                }

And I got full HDD loaded, c# read data with 100-110 MB/s speed. And execution time around 10 times lower.

How can I improve XML parse performance using golang?

答案1

得分: 4

这5个方法可以帮助使用encoding/xml库提高速度:
(针对具有75k条目、20MB大小的XMB进行测试,%s应用于前面的项目)

  1. 使用明确定义的结构体。
  2. 在所有结构体上实现xml.Unmarshaller
    • 需要大量代码。
    • 节省20%的时间和15%的内存分配。
  3. d.DecodeElement(&foo, &token)替换为foo.UnmarshallXML(d, &token)
    • 几乎100%安全。
    • 节省10%的时间和内存分配。
  4. 使用d.RawToken()替代d.Token()
    • 需要手动处理嵌套对象和命名空间。
    • 节省10%的时间和20%的内存分配。
  5. 如果使用d.Skip(),请使用d.RawToken()重新实现。

我在特定的用例中将时间和内存分配减少了40%,但代价是更多的代码、样板代码和对边界情况处理可能更差,不过我的输入数据相当一致,但这还不够。

在我的实验中,缺乏记忆化问题是导致XML解析器时间和内存分配较大的原因,这主要是由于Go的值复制造成的。

英文:

These 5 things can help increase speed using the encoding/xml library:
(Tested against XMB with 75k entries, 20MB, %s are applied to previous bullet)

  1. Use well defined structures
  2. Implement xml.Unmarshaller on all your structures
    • Lots of code
    • Saves 20% time and 15% allocs
  3. Replace d.DecodeElement(&foo, &token) with foo.UnmarshallXML(d, &token)
    • Almost 100% safe
    • Saves 10% time & allocs
  4. Use d.RawToken() instead of d.Token()
    • Needs manual handling of nested objects and namespaces
    • Saves 10% time & 20% allocs
  5. If use use d.Skip(), reimplement it using d.RawToken()

I reduced time and allocs by 40% on my specific usecase at the cost of more code, boileplate, and potentially worse handling of corner cases, but my inputs are fairly consistent, however it's not enough.

benchstat first.bench.txt parseraw.bench.txt 
name          old time/op    new time/op    delta
Unmarshal-16     1.06s ± 6%     0.66s ± 4%  -37.55%  (p=0.008 n=5+5)

name          old alloc/op   new alloc/op   delta
Unmarshal-16     461MB ± 0%     280MB ± 0%  -39.20%  (p=0.029 n=4+4)

name          old allocs/op  new allocs/op  delta
Unmarshal-16     8.42M ± 0%     5.03M ± 0%  -40.26%  (p=0.016 n=4+5)

On my experiments, the lack of memoizing issue is the reason for large time/allocs on the XML parser which slows down significantly, mostly because of Go copying by value.

答案2

得分: 3

回答你的问题:"如何使用Golang提高XML解析性能?"

使用常见的xml.NewDecoder / decoder.Token方法,我在本地看到的解析速度为50 MB/s。通过使用https://github.com/tamerh/xml-stream-parser,我能够将解析速度提高一倍。

为了测试,我使用了来自https://archive.org/details/stackexchange存档种子的Posts.xml文件(68 GB)。

package main

import (
	"bufio"
	"fmt"
	"github.com/tamerh/xml-stream-parser"
	"os"
	"time"
)

func main() {
	// 使用来自https://archive.org/details/stackexchange的`Posts.xml`文件(68 GB)
	f, err := os.Open("Posts.xml")
	if err != nil {
		panic(err)
	}
	defer f.Close()

	br := bufio.NewReaderSize(f, 1024*1024)
	parser := xmlparser.NewXmlParser(br, "row")

	started := time.Now()
	var previous int64 = 0

	for x := range *parser.Stream() {
		elapsed := int64(time.Since(started).Seconds())
		if elapsed > previous {
			kBytesPerSecond := int64(parser.TotalReadSize) / elapsed / 1024
			fmt.Printf("\r%ds elapsed, read %d kB/s (last post.Id %s)", elapsed, kBytesPerSecond, x.Attrs["Id"])
			previous = elapsed
		}
	}
}

这将输出类似以下的内容:

...秒已过,读取... kB/s(最后一个post.Id为...)

唯一的问题是,这种方法不能方便地将XML解析为结构体。

正如在https://github.com/golang/go/issues/21823中讨论的那样,速度似乎是Golang中XML实现的一个普遍问题,需要重写/重新思考标准库中的这部分内容。

英文:

To answer your question "How can i improve xml parse performance using golang?"

Using the common xml.NewDecoder / decoder.Token, I was seeing 50 MB/s locally. By using https://github.com/tamerh/xml-stream-parser I was able to double the parse speed.

To test I used Posts.xml (68 GB) from the https://archive.org/details/stackexchange archive torrent.

package main

import (
	"bufio"
	"fmt"
	"github.com/tamerh/xml-stream-parser"
	"os"
	"time"
)

func main() {
	// Using `Posts.xml` (68 GB) from https://archive.org/details/stackexchange (in the torrent)
	f, err := os.Open("Posts.xml")
	if err != nil {
		panic(err)
	}
	defer f.Close()

	br := bufio.NewReaderSize(f, 1024*1024)
	parser := xmlparser.NewXmlParser(br, "row")

	started := time.Now()
	var previous int64 = 0

	for x := range *parser.Stream() {
		elapsed := int64(time.Since(started).Seconds())
		if elapsed > previous {
			kBytesPerSecond := int64(parser.TotalReadSize) / elapsed / 1024
			fmt.Printf("\r%ds elapsed, read %d kB/s (last post.Id %s)", elapsed, kBytesPerSecond, x.Attrs["Id"])
			previous = elapsed
		}
	}
}

This will output something along the line of:

...s elapsed, read ... kB/s (last post.Id ...)

Only caveat is that this does not give you the convenient unmarshal into struct.

As discussed in https://github.com/golang/go/issues/21823, speed seems to be general problem with the XML implementation in Golang and would require a rewrite / rethink of that part of the standard library.

huangapple
  • 本文由 发表于 2017年9月10日 05:13:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/46135167.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定