英文:
Re-using the same encoder/decoder for the same struct type in Go without creating a new one
问题
我正在寻找在文件系统上持久存储结构化数据的最快/高效方法。我找到了gob模块,它允许为结构体设置编码器和解码器,将其转换为可以存储的[]byte(二进制)。
这相对容易 - 这是一个解码的示例:
// 单个项目的获取请求
// binary = 从数据库获取的编码二进制数据
// target = 接收解码结果的结构体
func Get(path string, target *SomeType) {
binary := someFunctionToGetBinaryFromSomeDB(path)
dec := gob.NewDecoder(bytes.NewReader(binary))
dec.Decode(target)
}
然而,当我将其与JSON编码器/解码器进行基准测试时,发现它几乎慢了一倍。当我创建一个循环来检索所有结构体时,这一点尤为明显。进一步的研究发现,每次创建一个新的解码器的开销非常大。大约重新创建了5000个解码器。
// 假设总共有5000个项目
func GetAll(target *[]SomeType{}) {
results := getAllBinaryStructsFromSomeDB()
for results.next() {
binary := results.getBinary()
// 每次都创建一个新的解码器
dec := gob.NewDecoder(bytes.NewReader(binary))
var target someType
dec.Decode(target)
// ... 将target追加到[]SomeType{}
}
}
我在这里陷入了困境,试图找出如何回收利用(减少重用回收!)用于列表检索的解码器。了解到解码器需要一个io.Reader,我想可能可以“重置”io.Reader并在相同地址上使用相同的读取器进行新的结构体检索。我不确定如何做到这一点,想知道是否有人有任何想法可以给我一些启示。我想要的是这样的东西:
// 假设总共有5000个项目
func GetAll(target *[]SomeType{}) {
// 设置可回收的读取器
var binary []byte
reader := bytes.NewReader(binary)
// 基于该读取器创建解码器
dec := gob.NewDecoder(reader)
results := getAllBinaryStructsFromSomeDB()
for results.next() {
// 插入某种二进制/解码器重置
// 然后做类似这样的操作:
reader.WriteTo(results.nextBinary())
var target someType
dec.Decode(target) // 当然这样不起作用
// ... 将target追加到[]SomeType{}
}
}
谢谢!
英文:
I was looking for the quickest/efficient way to store Structs of data to persist on the filesystem. I came across the gob module which allows encoders and decoders to be set up for structs to convert to []byte (binary) that can be stored.
This was relatively easy - here's a decoding example:
// Per item get request
// binary = []byte for the encoded binary from database
// target = struct receiving what's being decoded
func Get(path string, target *SomeType) {
binary = someFunctionToGetBinaryFromSomeDB(path)
dec := gob.NewDecoder(bytes.NewReader(binary))
dec.Decode(target)
}
However, when I benchmarked this against JSON encoder/decoder, I found it to be almost twice as slow. This was especially noticeable when I created a loop to retrieve all structs. Upon further research, I learned that creating a NEW decoder every time is really expensive. 5000 or so decoders are re-created.
// Imagine 5000 items in total
func GetAll(target *[]SomeType{}) {
results = getAllBinaryStructsFromSomeDB()
for results.next() {
binary = results.getBinary()
// Making a new decoder 5000 times
dec := gob.NewDecoder(bytes.NewReader(binary))
var target someType
dec.Decode(target)
// ... append target to []SomeType{}
}
}
I'm stuck here trying to figure out how I can recycle (reduce reuse recycle!) a decoder for list retrieval. Understanding that the decoder takes an io.Reader, I was thinking it would be possible to 'reset' the io.Reader and use the same reader at the same address for a new struct retrieval, while still using the same decoder. I'm not sure how to go about doing that and I'm wondering if anyone has any ideas to shed some light. What I'm looking for is something like this:
// Imagine 5000 items in total
func GetAll(target *[]SomeType{}) {
// Set up some kind of recyclable reader
var binary []byte
reader := bytes.NewReader(binary)
// Make decoder based on that reader
dec := gob.NewDecoder(reader)
results = getAllBinaryStructsFromSomeDB()
for results.next() {
// Insert some kind of binary / decoder reset
// Then do something like:
reader.WriteTo(results.nextBinary())
var target someType
dec.Decode(target) // except of course this won't work
// ... append target to []SomeType{}
}
}
Thanks!
答案1
得分: 2
编码器和解码器被设计用于处理值流。编码器在传输类型的第一个值之前,将描述Go类型的信息写入流中一次。解码器保留接收到的类型信息,以便解码后续的值。
编码器写入的类型信息取决于编码器遇到唯一类型的顺序、结构体字段的顺序等。为了理解流,解码器必须读取单个编码器写入的完整流。
由于类型信息的传输方式,无法回收解码器。
为了更具体地说明,以下代码是不起作用的:
var v1, v2 Type
var buf bytes.Buffer
gob.NewEncoder(&buf).Encode(v1)
gob.NewEncoder(&buf).Encode(v2)
var v3, v4 Type
d := gob.NewDecoder(&buf)
d.Decode(&v3)
d.Decode(&v4)
每次调用Encode都会向缓冲区写入关于Type
的信息。第二次调用Decode失败,因为接收到了重复的类型。
英文:
The encoder and decoder are designed to work with streams of values. The encoder writes information describing a Go type to the stream once before transmitting the first value of the type. The decoder retains received type information for decoding subsequent values.
The type information written by the encoder is dependent on the order that the encoder encounters unique types, the order of fields in structs and more. To make sense of the stream, a decoder must read the complete stream written by a single encoder.
It is not possible to recycle decoders because of the way that type information is transmitted.
To make this more concrete, the following does not work:
var v1, v2 Type
var buf bytes.Buffer
gob.NewEncoder(&buf).Encode(v1)
gob.NewEncoder(&buf).Encode(v2)
var v3, v4 Type
d := gob.NewDecoder(&buf)
d.Decode(&v3)
d.Decode(&v4)
Each call to Encode writes information about Type
to the buffer. The second call to Decode fails because a duplicate type is received.
答案2
得分: 2
我正在寻找将结构体数据存储到文件系统中的最快/高效的方法。
与其对结构体进行序列化,不如首先在适合你使用的预制数据存储中表示你的数据。然后在你的Go代码中对该数据进行建模。
这种方法可能看起来比较困难或者繁琐,但它可以通过智能地索引数据并允许进行过滤而不需要大量的文件系统访问来解决性能问题。
我正在寻找...持久化数据。
让我们从这个问题陈述开始。
gob模块允许为结构体设置编码器和解码器,以将其转换为可以存储的[]byte(二进制)数据。然而,我发现它的速度很慢。
这是可以理解的。你必须费尽心思才能使数据存储变得更慢。你从存储中实例化的每个对象都必须来自文件系统的读取。操作系统会很好地缓存这些小文件,但你仍然需要每次读取数据。
每次更改都需要重写所有数据,或者巧妙地确定要写入磁盘的哪些数据。请记住,文件没有“在两个字节之间插入”的操作;你将重写文件中后面的所有字节以在文件中间添加字节。
当然,你可以并发地进行这个操作,而且goroutine非常擅长处理像文件系统读取这样的异步工作。但现在你必须开始考虑锁定。
我的观点是,通过尝试序列化你的结构体,你可以更好地描述持久层的数据,并解决你甚至还没有处理的问题。
SQL是一个相当明显的选择,因为你可以使用sqlite以及其他良好扩展的SQL服务器;我听说mongodb在处理这些数据方面很容易,而且根据你对数据的处理方式,redis具有许多有吸引力的列表、集合和键值操作,可以轻松地实现原子性和一致性。
英文:
> I was looking for the quickest/efficient way to store Structs of data to persist on the filesystem
Instead of serializing your structs, represent your data primarily in a pre-made data store that fits your usage well. Then model that data in your Go code.
This may seem like the hard way or the long way to store data, but it will solve your performance problem by intelligently indexing your data and allowing filtering to be done without a lot of filesystem access.
> I was looking for ... data to persist.
Let's start there as a problem statement.
> gob module allows encoders and decoders to be set up for structs to convert to []byte (binary) that can be stored.
> However, ... I found it to be ... slow.
It would be. You'd have to go out of your way to make data storage any slower. Every object you instantiate from your storage will have to come from a filesystem read. The operating system will cache these small files well, but you'll still be reading the data every time.
Every change will require rewriting all the data, or cleverly determining which data to write to disk. Recall that there is no "insert between" operation for files; you'll be rewriting all bytes after to add bytes in the middle of a file.
You could do this concurrently, of course, and goroutines handle a bunch of async work like filesystem reads very well. But now you've got to start thinking about locking.
My point is, for the cost of trying to serialize your structures you can better describe your data at the persistent layer, and solve problems you're not even working on yet.
SQL is a pretty obvious choice, since you can make it work with sqlite as well as other sql servers that scale well; I hear mongodb is easy to wrangle these days, and depending on what you're doing with the data, redis has a number of attractive list, set and k/v operations that can easily be made atomic and consistent.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论