在Golang中读取Zlib压缩文件的最有效方法是什么?

huangapple go评论89阅读模式
英文:

Most efficient way to read Zlib compressed file in Golang?

问题

我正在读取并解析(解码)一个使用zlib压缩的自定义格式文件。我的问题是,如何在不增加切片大小的情况下高效地解压缩并解析未压缩的内容?我希望在读取时解析它,并将其读入可重用的缓冲区。

由于这是一个对速度要求较高的应用程序,所以我希望尽可能高效地读取它。通常情况下,我会使用ioutil.ReadAll读取所有内容,然后再通过数据进行解析。这次我希望在读取时进行解析,而不必将缓冲区扩展到读取的大小,以实现最大的效率。

基本上,我认为如果我能找到一个完美大小的缓冲区,那么我可以将数据读入其中,解析它,然后再次覆盖缓冲区,然后解析它,依此类推。问题在于,zlib读取器似乎每次调用Read(b)时都会读取任意数量的字节;它不会填充切片。因此,我不知道完美的缓冲区大小是多少。我担心它可能会将我写入的一些数据分成两个块,这样解析起来就很困难,因为一个uint64可能会被分成两次读取,因此不会出现在同一个缓冲区读取中 - 或者可能永远不会发生这种情况,它总是以与最初写入的大小相同的块读取出来?

  1. 什么是最佳的缓冲区大小,或者是否有一种计算方法?
  2. 如果我使用f.Write(b []byte)将数据写入zlib写入器,当读取压缩数据时,是否可能将相同的数据分成两次读取(这意味着我在解析过程中必须保留历史记录),还是它总是以相同的读取方式返回?
英文:

I'm reading in and at the same time parsing (decoding) a file in a custom format, which is compressed with zlib. My question is how can I efficiently uncompress and then parse the uncompressed content without growing the slice? I would like to parse it whilst reading it into a reusable buffer.

This is for a speed-sensitive application and so I'd like to read it in as efficiently as possible. Normally I would just ioutil.ReadAll and then loop again through the data to parse it. This time I'd like to parse it as it's read, without having to grow the buffer into which it is read, for maximum efficiency.

Basically I'm thinking that if I can find a buffer of the perfect size then I can read into this, parse it, and then write over the buffer again, then parse that, etc. The issue here is that the zlib reader appears to read an arbitrary number of bytes each time Read(b) is called; it does not fill the slice. Because of this I don't know what the perfect buffer size would be. I'm concerned that it might break up some of the data that I wrote into two chunks, making it difficult to parse because one say uint64 could be split from into two reads and therefore not occur in the same buffer read - or perhaps that can never happen and it's always read out in chunks of the same size as were originally written?

  1. What is the optimal buffer size, or is there a way to calculate this?
  2. If I have written data into the zlib writer with f.Write(b []byte) is it possible that this same data could be split into two reads when reading back the compressed data (meaning I will have to have a history during parsing), or will it always come back in the same read?

答案1

得分: 0

你可以将你的zlib读取器包装在一个bufio读取器中,然后在其上实现一个专门的读取器,通过从bufio读取器中读取数据来重建数据块,直到读取到完整的数据块为止。请注意,bufio.Read在底层读取器上最多调用一次Read,所以你需要在循环中调用ReadByte。然而,bufio会为你处理zlib读取器返回的不可预测大小的数据。

如果你不想实现一个专门的读取器,你可以只使用bufio读取器,并使用ReadByte()读取所需的字节数来填充给定的数据类型。最佳的缓冲区大小至少应该是你最大数据结构的大小,最多可以是你可以装入内存的任何大小。

如果你直接从zlib读取器中读取,不能保证你的数据不会在两次读取之间被分割。

另一种可能更简洁的解决方案是为你的数据实现一个写入器,然后使用io.Copy(your_writer, zlib_reader)。

英文:

You can wrap your zlib reader in a bufio reader, then implement a specialized reader on top that will rebuild your chunks of data by reading from the bufio reader until a full chunk is read. Be aware that bufio.Read calls Read at most once on the underlying Reader, so you need to call ReadByte in a loop. bufio will however take care of the unpredictable size of data returned by the zlib reader for you.

If you do not want to implement a specialized reader, you can just go with a bufio reader and read as many bytes as needed with ReadByte() to fill a given data type. The optimal buffer size is at least the size of your largest data structure, up to whatever you can shove into memory.

If you read directly from the zlib reader, there is no guarantee that your data won't be split between two reads.

Another, maybe cleaner, solution is to implement a writer for your data, then use io.Copy(your_writer, zlib_reader).

答案2

得分: 0

好的,以下是翻译好的内容:

好的,最后我使用自己实现的读取器解决了这个问题。

基本上,结构体的样子是这样的:

type reader struct {
 at int
 n int
 f io.ReadCloser
 buf []byte
}

这可以附加到 zlib 读取器上:

// 打开文件进行读取
fi, err := os.Open(filename)
if err != nil {
	return nil, err
}
defer fi.Close()
// 附加 zlib 读取器
r := new(reader)
r.buf = make([]byte, 2048)
r.f, err = zlib.NewReader(fi)
if err != nil {
	return nil, err
}
defer r.f.Close()

然后可以使用以下类似的函数直接从 zlib 读取器中读取 x 个字节:

mydata := r.readx(10)

func (r *reader) readx(x int) []byte {
	for r.n < x {
		copy(r.buf, r.buf[r.at:r.at+r.n])
		r.at = 0
		m, err := r.f.Read(r.buf[r.n:])
		if err != nil {
			panic(err)
		}
		r.n += m
	}
	tmp := make([]byte, x)
	copy(tmp, r.buf[r.at:r.at+x]) // 必须复制以避免内存泄漏
	r.at += x
	r.n -= x
	return tmp
}

请注意,我不需要检查 EOF,因为我的解析器应该会在正确的位置停止。

英文:

OK, so I figured this out in the end using my own implementation of a reader.

Basically the struct looks like this:

type reader struct {
 at int
 n int
 f io.ReadCloser
 buf []byte
}

This can be attached to the zlib reader:

// Open file for reading
fi, err := os.Open(filename)
if err != nil {
	return nil, err
}
defer fi.Close()
// Attach zlib reader
r := new(reader)
r.buf = make([]byte, 2048)
r.f, err = zlib.NewReader(fi)
if err != nil {
	return nil, err
}
defer r.f.Close()

Then x number of bytes can be read straight out of the zlib reader using a function like this:

mydata := r.readx(10)

func (r *reader) readx(x int) []byte {
	for r.n < x {
		copy(r.buf, r.buf[r.at:r.at+r.n])
		r.at = 0
		m, err := r.f.Read(r.buf[r.n:])
		if err != nil {
			panic(err)
		}
		r.n += m
	}
	tmp := make([]byte, x)
	copy(tmp, r.buf[r.at:r.at+x]) // must be copied to avoid memory leak
	r.at += x
	r.n -= x
	return tmp
}

Note that I have no need to check for EOF because I my parser should stop itself at the right place.

huangapple
  • 本文由 发表于 2014年12月16日 20:38:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/27504872.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定