英文:
Concurrently parsing records in a binary file in Go
问题
我有一个二进制文件需要解析。该文件被分成了每个1024字节的记录。需要进行以下高级步骤:
- 从文件中每次读取1024字节。
- 解析每个1024字节的“记录”(块),并将解析后的数据放入一个映射或结构体中。
- 将解析后的数据和任何错误返回给用户。
由于I/O限制,我认为尝试并发地从文件中进行读取可能没有意义。然而,我认为可以使用goroutine来解析1024字节的记录,以便同时解析多个1024字节的记录。我是Go的新手,所以我想知道这样做是否有意义,或者是否有更好(更快)的方法:
- 一个主函数打开文件,并每次读取1024字节到字节数组(记录)中。
- 将记录传递给一个函数,该函数将数据解析为映射或结构体。解析器函数将作为goroutine在每个记录上调用。
- 解析后的映射/结构体通过通道附加到切片中。我会预先分配由切片管理的底层数组,大小为文件大小(以字节为单位)除以1024,因为这应该是准确的元素数量(假设没有错误)。
我还需要确保不会耗尽内存,因为文件的大小可以从几百MB到256TB(罕见,但可能)。这样做是否有意义,或者我对这个问题的思考方式是否有误?将并发地解析这些记录作为字节数组是否比按照线性方式一次读取1024字节的方式更快?或者我对问题的思考方式完全错误?
英文:
I have a binary file that I want to parse. The file is broken up into records that are 1024 bytes each. The high level steps needed are:
- Read 1024 bytes at a time from the file.
- Parse each 1024-byte "record" (chunk) and place the parsed data into a map or struct.
- Return the parsed data to the user and any error(s).
I'm not looking for code, just design/approach help.
Due to I/O constraints, I don't think it makes sense to attempt concurrent reads from the file. However, I see no reason why the 1024-byte records can't be parsed using goroutines so that multiple 1024-byte records are being parsed concurrently. I'm new to Go, so I wanted to see if this makes sense or if there is a better (faster) way:
- A main function opens the file and reads 1024 bytes at a time into byte arrays (records).
- The records are passed to a function that parses the data into a map or struct. The parser function would be called as a goroutine on each record.
- The parsed maps/structs are appended to a slice via a channel. I would preallocate the underlying array managed by the slice as the file size (in bytes) divided by 1024 as this should be the exact number of elements (assuming no errors).
I'd have to make sure I don't run out of memory as well, as the file can be anywhere from a few hundred MB up to 256 TB (rare, but possible). Does this make sense or am I thinking about this problem incorrectly? Will this be slower than simply parsing the file in a linear fashion as I read it 1024 bytes at a time, or will parsing these records concurrently as byte arrays perform better? Or am I thinking about the problem all wrong?
I'm not looking for code, just design/approach help.
答案1
得分: 2
这是生产者-消费者问题的一个实例,其中生产者是你的主函数,它生成1024字节的记录,而消费者应该处理这些记录并将它们发送到一个通道,以便将它们添加到最终的切片中。有一些标记为生产者-消费者和Go的问题,它们应该能帮助你入门。至于在你的情况下哪种方法最快,这取决于很多因素,所以很难回答。最佳解决方案可能是从完全顺序的实现到由RabbitMQ或类似的东西在集群服务器之间移动记录的方案。
英文:
This is an instance of the producer-consumer problem, where the producer is your main function that generates 1024-byte records and the consumers should process these records and send them to a channel so they are added to the final slice. There are a few questions tagged producer-consumer and Go, they should get you started. As for what is fastest in your case, it depends on so many things that it is really not possible to answer. The best solution may be anywhere from a completely sequential implementation to a cluster of servers in which the records are moved around by RabbitMQ or something similar.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论