英文:
Standard xml parser has very low performance in Golang
问题
我有一个100GB的XML文件,并使用Go语言的SAX方法解析它,以下是代码:
file, err := os.Open(filename)
handle(err)
defer file.Close()
buffer := bufio.NewReaderSize(file, 1024*1024*256) // 33554432
decoder := xml.NewDecoder(buffer)
for {
t, _ := decoder.Token()
if t == nil {
break
}
switch se := t.(type) {
case xml.StartElement:
if se.Name.Local == "House" {
house := House{}
err := decoder.DecodeElement(&house, &se)
handle(err)
}
}
}
但是Go语言的执行速度非常慢,似乎是由于执行时间和磁盘使用量的原因。我的硬盘驱动器的读取速度大约为100-120 MB/s,但是Go语言只使用了10-13 MB/s。
为了进行实验,我将这段代码改写为C#:
using (XmlReader reader = XmlReader.Create(filename))
{
while (reader.Read())
{
switch (reader.NodeType)
{
case XmlNodeType.Element:
if (reader.Name == "House")
{
//Code
}
break;
}
}
}
这样我就可以充分利用硬盘驱动器,C#以100-110 MB/s的速度读取数据,并且执行时间大约是Go语言的10倍。
如何提高使用Go语言解析XML的性能呢?
英文:
I have a 100GB XML file and parse it with SAX method in go with this code
file, err := os.Open(filename)
handle(err)
defer file.Close()
buffer := bufio.NewReaderSize(file, 1024*1024*256) // 33554432
decoder := xml.NewDecoder(buffer)
for {
t, _ := decoder.Token()
if t == nil {
break
}
switch se := t.(type) {
case xml.StartElement:
if se.Name.Local == "House" {
house := House{}
err := decoder.DecodeElement(&house, &se)
handle(err)
}
}
}
But golang working very slow, its seems by execution time and disk usage. My HDD capable to read data with speed around 100-120 MB/s, but golang uses only 10-13 MB/s.
For experiment I rewrite this code in C#:
using (XmlReader reader = XmlReader.Create(filename)
{
while (reader.Read())
{
switch (reader.NodeType)
{
case XmlNodeType.Element:
if (reader.Name == "House")
{
//Code
}
break;
}
}
}
And I got full HDD loaded, c# read data with 100-110 MB/s speed. And execution time around 10 times lower.
How can I improve XML parse performance using golang?
答案1
得分: 4
这5个方法可以帮助使用encoding/xml
库提高速度:
(针对具有75k条目、20MB大小的XMB进行测试,%s应用于前面的项目)
- 使用明确定义的结构体。
- 在所有结构体上实现
xml.Unmarshaller
。- 需要大量代码。
- 节省20%的时间和15%的内存分配。
- 将
d.DecodeElement(&foo, &token)
替换为foo.UnmarshallXML(d, &token)
。- 几乎100%安全。
- 节省10%的时间和内存分配。
- 使用
d.RawToken()
替代d.Token()
。- 需要手动处理嵌套对象和命名空间。
- 节省10%的时间和20%的内存分配。
- 如果使用
d.Skip()
,请使用d.RawToken()
重新实现。
我在特定的用例中将时间和内存分配减少了40%,但代价是更多的代码、样板代码和对边界情况处理可能更差,不过我的输入数据相当一致,但这还不够。
在我的实验中,缺乏记忆化问题是导致XML解析器时间和内存分配较大的原因,这主要是由于Go的值复制造成的。
英文:
These 5 things can help increase speed using the encoding/xml
library:
(Tested against XMB with 75k entries, 20MB, %s are applied to previous bullet)
- Use well defined structures
- Implement
xml.Unmarshaller
on all your structures- Lots of code
- Saves 20% time and 15% allocs
- Replace
d.DecodeElement(&foo, &token)
withfoo.UnmarshallXML(d, &token)
- Almost 100% safe
- Saves 10% time & allocs
- Use
d.RawToken()
instead ofd.Token()
- Needs manual handling of nested objects and namespaces
- Saves 10% time & 20% allocs
- If use use
d.Skip()
, reimplement it usingd.RawToken()
I reduced time and allocs by 40% on my specific usecase at the cost of more code, boileplate, and potentially worse handling of corner cases, but my inputs are fairly consistent, however it's not enough.
benchstat first.bench.txt parseraw.bench.txt
name old time/op new time/op delta
Unmarshal-16 1.06s ± 6% 0.66s ± 4% -37.55% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
Unmarshal-16 461MB ± 0% 280MB ± 0% -39.20% (p=0.029 n=4+4)
name old allocs/op new allocs/op delta
Unmarshal-16 8.42M ± 0% 5.03M ± 0% -40.26% (p=0.016 n=4+5)
On my experiments, the lack of memoizing issue is the reason for large time/allocs on the XML parser which slows down significantly, mostly because of Go copying by value.
答案2
得分: 3
回答你的问题:"如何使用Golang提高XML解析性能?"
使用常见的xml.NewDecoder
/ decoder.Token
方法,我在本地看到的解析速度为50 MB/s。通过使用https://github.com/tamerh/xml-stream-parser,我能够将解析速度提高一倍。
为了测试,我使用了来自https://archive.org/details/stackexchange存档种子的Posts.xml
文件(68 GB)。
package main
import (
"bufio"
"fmt"
"github.com/tamerh/xml-stream-parser"
"os"
"time"
)
func main() {
// 使用来自https://archive.org/details/stackexchange的`Posts.xml`文件(68 GB)
f, err := os.Open("Posts.xml")
if err != nil {
panic(err)
}
defer f.Close()
br := bufio.NewReaderSize(f, 1024*1024)
parser := xmlparser.NewXmlParser(br, "row")
started := time.Now()
var previous int64 = 0
for x := range *parser.Stream() {
elapsed := int64(time.Since(started).Seconds())
if elapsed > previous {
kBytesPerSecond := int64(parser.TotalReadSize) / elapsed / 1024
fmt.Printf("\r%ds elapsed, read %d kB/s (last post.Id %s)", elapsed, kBytesPerSecond, x.Attrs["Id"])
previous = elapsed
}
}
}
这将输出类似以下的内容:
...秒已过,读取... kB/s(最后一个post.Id为...)
唯一的问题是,这种方法不能方便地将XML解析为结构体。
正如在https://github.com/golang/go/issues/21823中讨论的那样,速度似乎是Golang中XML实现的一个普遍问题,需要重写/重新思考标准库中的这部分内容。
英文:
To answer your question "How can i improve xml parse performance using golang?"
Using the common xml.NewDecoder
/ decoder.Token
, I was seeing 50 MB/s locally. By using https://github.com/tamerh/xml-stream-parser I was able to double the parse speed.
To test I used Posts.xml
(68 GB) from the https://archive.org/details/stackexchange archive torrent.
package main
import (
"bufio"
"fmt"
"github.com/tamerh/xml-stream-parser"
"os"
"time"
)
func main() {
// Using `Posts.xml` (68 GB) from https://archive.org/details/stackexchange (in the torrent)
f, err := os.Open("Posts.xml")
if err != nil {
panic(err)
}
defer f.Close()
br := bufio.NewReaderSize(f, 1024*1024)
parser := xmlparser.NewXmlParser(br, "row")
started := time.Now()
var previous int64 = 0
for x := range *parser.Stream() {
elapsed := int64(time.Since(started).Seconds())
if elapsed > previous {
kBytesPerSecond := int64(parser.TotalReadSize) / elapsed / 1024
fmt.Printf("\r%ds elapsed, read %d kB/s (last post.Id %s)", elapsed, kBytesPerSecond, x.Attrs["Id"])
previous = elapsed
}
}
}
This will output something along the line of:
...s elapsed, read ... kB/s (last post.Id ...)
Only caveat is that this does not give you the convenient unmarshal into struct.
As discussed in https://github.com/golang/go/issues/21823, speed seems to be general problem with the XML implementation in Golang and would require a rewrite / rethink of that part of the standard library.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论