2016年3月26日 05:53:36go评论118阅读模式

英文:

GoLang: Decompress bz2 in on goroutine, consume in other goroutine

问题

我是一名新毕业的软件工程师，正在学习Go语言（并且非常喜欢它）。

我正在构建一个用于解析维基百科转储文件的解析器，这个文件基本上是一个巨大的bzip2压缩的XML文件（未压缩时约50GB）。

我想要同时进行流式解压缩和解析，听起来很简单。对于解压缩，我这样做：

inputFilePath := flag.Arg(0) inputReader := bzip2.NewReader(inputFile)

然后将读取器传递给XML解析器：

decoder := xml.NewDecoder(inputFile)

然而，由于解压缩和解析都是耗时的操作，我希望它们在单独的Go协程中运行，以利用额外的核心。在Go语言中，我该如何做到这一点呢？

我唯一能想到的方法是将文件包装在一个chan []byte中，并实现io.Reader接口，但我想可能有一种内置的（更简洁）方法来实现这一点。

有人曾经做过类似的事情吗？

谢谢！
Manuel

英文:

I am a new-grad SWE learning Go (and loving it).

I am building a parser for Wikipedia dump files - basically a huge bzip2-compressed XML file (~50GB uncompressed).

I want to do both streaming decompression and parsing, which sounds simple enough. For decompression, I do:

inputFilePath := flag.Arg(0) inputReader := bzip2.NewReader(inputFile)

And then pass the reader to the XML parser:

decoder := xml.NewDecoder(inputFile)

However, since both decompressing and parsing are expensive operations, I would like to have them run on separate Go routines to make use of additional cores. How would I go about doing this in Go?

The only thing I can think of is wrapping the file in a chan []byte, and implementing the io.Reader interface, but I presume there might be a built way (and cleaner) way of doing it.

Has anyone ever done something like this?

Thanks!
Manuel

答案1

得分: 2

你可以使用 io.Pipe 来创建一个管道，然后使用 io.Copy 将解压后的数据推送到管道中，并在另一个 goroutine 中读取它：

package main
import (
	"bytes"
	"encoding/json"
	"fmt"
	"io"
	"sync"
)
func main() {
	rawJson := []byte(`{
		"Foo": {
			"Bar": "Baz"
		}
	}`)
	bzip2Reader := bytes.NewReader(rawJson) // 这里代替了 bzip2.NewReader
	var wg sync.WaitGroup
	wg.Add(2)
	r, w := io.Pipe()
	go func() {
		// 将所有内容写入管道中，解压缩在这个 goroutine 中进行
		io.Copy(w, bzip2Reader)
		w.Close()
		wg.Done()
	}()
	decoder := json.NewDecoder(r)
	go func() {
		for {
			t, err := decoder.Token()
			if err != nil {
				break
			}
			fmt.Println(t)
		}
		wg.Done()
	}()
	wg.Wait()
}

你可以在这里运行这段代码。

英文:

You can use io.Pipe, then use io.Copy to push the decompressed data into the pipe, and read it in another goroutine:

package main
import (
	&quot;bytes&quot;
	&quot;encoding/json&quot;
	&quot;fmt&quot;
	&quot;io&quot;
	&quot;sync&quot;
)
func main() {
	rawJson := []byte(`{
    		&quot;Foo&quot;: {
    			&quot;Bar&quot;: &quot;Baz&quot;
    		}
    	}`)
	bzip2Reader := bytes.NewReader(rawJson) // this stands in for the bzip2.NewReader
	var wg sync.WaitGroup
	wg.Add(2)
	r, w := io.Pipe()
	go func() {
		// write everything into the pipe. Decompression happens in this goroutine.
		io.Copy(w, bzip2Reader)
		w.Close()
		wg.Done()
	}()
	decoder := json.NewDecoder(r)
	go func() {
		for {
			t, err := decoder.Token()
			if err != nil {
				break
			}
			fmt.Println(t)
		}
		wg.Done()
	}()
	wg.Wait()
}

http://play.golang.org/p/fXLnfnaWYA

答案2

得分: 0

一个简单的解决方案是使用我一段时间前创建的一个readahead包：https://github.com/klauspost/readahead

inputReader := bzip2.NewReader(inputFile)
ra := readahead.NewReader(input)
defer ra.Close()

然后将读取器传递给XML解析器：

decoder := xml.NewDecoder(ra)

使用默认设置，它将提前解码4个缓冲区中的4MB数据。

英文:

An easy solution is to use a readahead package I created some time back: https://github.com/klauspost/readahead

inputReader := bzip2.NewReader(inputFile)
ra := readahead.NewReader(input)
defer ra.Close()

And then pass the reader to the XML parser:

decoder := xml.NewDecoder(ra)

With default settings it will decode up to 4MB ahead of time in 4 buffers.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

GoLang：在一个 goroutine 中解压 bz2 文件，在另一个 goroutine 中进行消费。

问题

答案1

答案2

Go importing vendor dependencies issue when building a Go 1.7 project using govendor, dh-make-golang

cmd.StdoutPipe在Go包文档中的示例在playground中无法运行。

无法从 Golang 服务器中以 JavaScript 获取响应。

Go – How do you change the value of a pointer parameter?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。