检索通过多次追加写入文件的大量数据

huangapple go评论84阅读模式
英文:

Retrieving gobs written to file by appending several times

问题

我正在尝试使用encoding/gob将数据存储到文件中,并在以后加载它。我希望能够将新数据追加到文件中,并在以后加载所有保存的数据,例如在重新启动应用程序后。在使用Encode()将数据存储到文件时没有问题,但是在读取时似乎总是只能获取到最先存储的项,而不是后续存储的项。

这是一个最简示例:https://play.golang.org/p/patGkKDLhM

如你所见,写入两次到编码器然后读取回来是可以的。但是当关闭文件并以追加模式重新打开时,写入似乎是有效的,但是读取只对前两个元素有效(之前已经写入的元素)。无法检索到新增加的两个结构体,我得到了错误信息:

> panic: extra data in buffer

我知道https://stackoverflow.com/questions/30927993/append-to-golang-gob-in-a-file-on-disk和https://groups.google.com/forum/#!topic/golang-nuts/bn6vjC5Abd8。最后,我还发现了https://gist.github.com/kjk/8015952,它似乎证明了我尝试做的事情是行不通的。为什么?这个错误是什么意思?

英文:

I am trying to use encoding/gob to store data to a file and load it later. I want to be able to append new data to the file and load all saved data later, e.g. after restarting my application. While storing to the file using Encode() there are no problems, but when reading it seems I always get only the item which was first stored, not the succinctly stored items.

Here is a minimal example: https://play.golang.org/p/patGkKDLhM

As you see, it works to write two times to an encoder and then read it back. But when closing the file and reopening it again in append mode, writing seems to work, but reading works only for the first two elements (which have been written previously). The two newly added structs cannot be retrieved, I get the error:

> panic: extra data in buffer

I am aware of https://stackoverflow.com/questions/30927993/append-to-golang-gob-in-a-file-on-disk and I also read https://groups.google.com/forum/#!topic/golang-nuts/bn6vjC5Abd8

Finally, I also found https://gist.github.com/kjk/8015952 which seems to demonstrate that what I am trying to do does not work. Why? What does this error mean?

答案1

得分: 3

我还没有使用过encoding/gob包(看起来很酷,我可能需要找个项目来试试)。但是根据阅读godoc的内容,我觉得每个编码都是一个单独的记录,需要从头到尾解码。也就是说,一旦你对一个流进行了编码,得到的字节序列就是一个完整的集合,尊重整个流的开始和结束,不能通过再次编码来追加。

godoc中指出,编码的gob是自描述的。在编码流的开头,它描述了整个数据集结构、类型等,包括字段名。然后在字节流中,跟随的是这些导出字段的大小和字节表示。

那么可以假设文档中省略的是,由于流在开头自我描述,包括即将传递的每个字段,这就是Decoder关心的全部内容。Decoder不会知道在描述之后添加的任何连续字节,因为它只看到了开头描述的内容。因此,错误消息panic: extra data in buffer是准确的。

在你的Playground示例中,你对同一个编码器实例进行了两次编码,然后关闭了文件。由于你传入了两个记录,并对两个记录进行了编码,这可能会起作用,因为编码器的单个实例可能会将两个Encode调用视为单个编码流。然后当你关闭文件io流时,gob现在是完整的,流被视为单个记录(即使你发送了两种类型)。

在解码函数中也是一样,你从同一个流中读取了X次。但是,当关闭文件时,你实际上是写入了一个包含两种类型的单个记录。这就是为什么在读取2个记录时它能正常工作,但是如果读取超过2个就会失败。

如果你想将这些记录存储在单个文件中,一个解决方案是你需要创建自己的索引,用于每个完整的“写入”或编码器实例/会话。可以通过自定义的Block方法来包装或定义写入到磁盘的每个条目,并在开始和结束位置添加标记。这样,在读取文件时,你就知道要分配多大的缓冲区,因为有了开始和结束标记。一旦在缓冲区中有了单个记录,然后你可以使用gob的Decoder来解码它。并在每次写入后关闭文件。

我通常使用这样的标记模式:

uint64:uint64
uint64:uint64
...

第一个是起始字节号,通过冒号分隔的第二个条目是长度。不过我通常将其存储在另一个文件中,适当地称为indexes。这样可以快速将其读入内存,然后可以在字节流中准确地知道每个起始和结束地址的位置。

另一种选择是将每个gob存储在自己的文件中,使用文件系统目录结构进行组织(或者甚至可以使用目录来定义类型,例如)。然后每个文件的存在就是一个单独的记录。这就是我在事件溯源技术中使用的方式,将数百万个文件存储在组织良好的目录中。

总结一下,对我来说,gob数据是从头到尾的完整数据集,一个单独的“记录”。如果你想存储多个编码/多个gob,那么你需要创建自己的索引来跟踪每个gob字节的起始位置和大小/结束位置。然后,你将希望分别Decode每个条目。

英文:

I have not used the encoding/gob package yet (looks cool, I might have to find a project for it). But reading the godoc, it would seem to me that each encoding is a single record expected to be decoded from beginning to end. That is, once you Encode a stream, the resulting bytes is a complete set respecting the entire stream from start to finish - not able to be appended to later by encoding again.

The godoc states that an encoded gob is self-descriptive. At the beginning of the encoded stream, it describes the entire data set struct, types, etc that will be following including the field names. Then what follows in the byte stream is the the size and byte representation of the value of those Exported fields.

Then one could assume that what is omitted from the docs is since the stream self-describes itself at the very beginning, including each field that is about to be passed, that is all that the Decoder will care about. The Decoder will not know of any successive bytes added after what has been described as it only sees what was described at the beginning. Therefore, that error message panic: extra data in buffer is accurate.

In your Playground example, you are encoding twice to the same encoder instance and then closing the file. Since you are passing exactly two records in, and encoding two records, that may work as the single instance of the encoder may see the two Encode calls as a single encoded stream. Then when you close the file io's stream, the gob is now complete - and the stream is treated as a single record (even though you sent in two types).

And the same in the decoding function, you are reading X number of times from the same stream. But, you are writing a single record when closing the file - that actually has two types in that one single record. Hence why it works when reading 2, and EXACTLY 2. But fails if reading more than 2.

A solution, if you want to store this in a single file, is that you will need to create your own index of each complete "write" or encoder instance/session. Some form your own Block method that allows you to wrap or define each entry written to disk with a "begin" and "end" marker. That way, when reading back the file, you know exactly what buffer to allocate because of the begin/end markers. Once you have a single record in a buffer, then you use gob's Decoder to decode it. And close the file after each write.

The pattern I use for such markers is something like:

uint64:uint64
uint64:uint64
...

The first being the beginning byte number, and the second entry separated by a colon being its length. I usually store this in another file though, called appropriately indexes. That way it can be quickly read into memory, and then I can stream the large file knowing exactly where each start and end address is in the byte stream.

Another option is just to store each gob in its own file, using the file system directory structure to organize as you see fit (or one could even use the directories to define types, for example). Then the existence of each file is a single record. This is how I use my rendered json from Event Sourcing techniques, storing millions of files organized in directories.

In summary, it would seem to me that a gob of data is a complete set of data from beginning to end - a single "record" have you. If you want to store multiple encodings/multiple gobs, then to will need to create your own index to track the start and size/end of each gob bytes as you store them. Then, you will want to Decode each entry separately.

huangapple
  • 本文由 发表于 2016年4月3日 21:26:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/36385955.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定