英文:
What is the fastest way to read text file from hard drive into memory using Go?
问题
我刚刚开始使用Go语言,之前使用Perl多年,从最初的测试来看,从硬盘中读取文本文件到哈希表中的速度似乎不如Perl快。
在Perl中,我使用"File::Slurp"模块,它可以帮助快速将文件读入内存(作为字符串变量、数组或哈希表)-在硬盘读取吞吐量的限制范围内。
我不确定在Go中使用什么方法可以将例如500MB的CSV文件(包含10列)快速读入内存(作为哈希表),其中哈希表的键是第一列,值是其他9列。
实现这个目标的最快方法是什么?目标是尽快将数据从硬盘读取并存储到某个Go内存变量中。
这是输入文件中的一行-大约有2000万行类似的行:
1341,2014-11-01 00:01:23.588,12000,AV7WN259SEH1,1133922,SingleOven/HCP/-PRODUCTION/-23C_30S,0xd8d2a106d44bea07,8665456.006,5456-02,3010-30 N- PHOTO,AV7WN259SEH1
平台是Windows 7- i7 Intel处理器,16GB内存。如果在Linux上安装Go有好处的话,我也可以在Linux上安装。
编辑:
因此,一个用例是-尽快将整个文件加载到内存中的一个变量中。稍后,我可以扫描该变量,进行内存中的拆分等操作。
另一种方法是在加载时将每行存储为键值对(例如,在传递了X个字节或出现了\N字符后)。
对我来说,这两种方法可能会产生不同的性能结果。但由于我对Golang非常陌生-尝试不同的技术来制定最佳性能算法可能需要我几天的时间。
我想了解在Golang中实现上述目标的所有可能方法,以及推荐的方法。目前,我不关心内存使用情况,因为一旦第一个文件处理完成(每个文件将在处理完成后立即从内存中删除),这个过程将重复10,000次。文件的大小范围从50MB到500MB。由于有数千个文件-任何性能提升(即使每个文件节省1秒)都是显著的整体收益。
我不想给问题增加复杂性,关于数据之后将如何处理的问题,我只想了解从硬盘中读取文件并存储到哈希表中的最快方法。我将对我的发现进行更详细的基准测试,并在我学到更多关于在Golang中实现不同方法的知识和听到更多建议时进行更新。我希望有人已经对这个主题进行了研究。
英文:
I just start using Go after years of using Perl and from initial tests seems like reading text file from a hard drive into hash is not as fast as Perl.
In Perl I use "File::Slurp" module and it helps reading file into memory (into string variable, array, or hash) really fast - in the limits of hard drive Read throughput.
I am not sure what is the best way by using Go to read e.g. 500MB CSV file with 10 columns into memory (into hash) where Key of a Hash is 1st column and Value is rest of 9 columns.
What is the fastest way to achieve this? Goal is to read and store into some Go memory variable as fast as Hard drive can deliver data.
This is one line from input file - there are around 20 million similar lines:
1341,2014-11-01 00:01:23.588,12000,AV7WN259SEH1,1133922,SingleOven/HCP/-PRODUCTION/-23C_30S,0xd8d2a106d44bea07,8665456.006,5456-02,3010-30 N- PHOTO,AV7WN259SEH1
Platform is Win 7 - i7 Intel processor with 16GB Ram. I can install Go on Linux as well if there are benefits in doing so.
Edit:
So one use case that is - load whole file into memory as fast as you can into 1 variable. Later I can scan that variable, split (all in memory) etc.
Another approach is to to store each line as key-value pair during load time (e.g. after X bites are passed or after \N character arrive).
To me - these 2 approaches can yield different performance results. But since I am very new to Golang - it will probably take me days to make best performance algorithm in Golang trying different techniques.
I would like to learn all possible ways to do above in Golang and also recommended ways. At this point I am no concerned about memory usage since this process will be repeated 10,000 times soon as first file processing is finished (each file will be erased from memory soon as processing is done). Files range from 50MB to 500MB. Since there are several thousands of files - any performance gain (even 1 sec gain per file) is significant overall gain.
I do not want to add complexity to the question about what will be done with data later but just want to learn about fastest way to read file from drive and store in hash. I will put more detailed benchmarks on my findings and also as I learn more about different ways to do it in Golang and as I hear more recommendations. I am hoping someone already did research on this topic.
答案1
得分: 2
ioutil.ReadFile 可能是将整个文件读入内存的一个很好的起点。话虽如此,这听起来是对内存资源的不良使用。问题声称 File::Slurp 很快,但这并不是针对你正在进行的特定任务的普遍共识,也就是逐行处理。
声称 Perl 在某种程度上以“快速”方式执行操作。我们可以查看 Perl 的 File::Slurp
的源代码。据我所知,它并没有进行任何魔法操作。正如 Slade 在评论中提到的,它只是使用了 sysopen 和 sysread,这两者最终都会转化为普通的操作系统调用。坦率地说,一旦你涉及到磁盘 I/O,你就已经输了:你唯一的希望就是尽可能少地触碰它。
考虑到你的文件大小为 500MB,你必须读取磁盘文件的所有字节,并且必须进行逐行处理,我不太明白为什么要求在两次传递中完成这个操作。为什么要将这个从根本上是一个一次传递算法转化为两次传递算法呢?
在你没有展示其他代码的情况下,我们无法确定你所做的是快还是慢。没有测量,我们无法得出任何实质性的结论。你是否尝试过首先使用 bufio.Scanner() 编写直接的代码,然后进行 性能测量?
英文:
ioutil.ReadFile is probably a good first start to read a whole file into memory. That being said, this sounds like a poor use of memory resources. The question asserts that File::Slurp is fast, but this is not general consensus for the particular task you're doing, that is, line-by-line processing.
The claim is that Perl is somehow doing things "fast". We can look at the source code to Perl's File::Slurp
. It's not doing any magic, as far as I can tell. As Slade mentions in comments, it's just using sysopen and sysread, both of which eventually bottom out to plain operating system calls. Frankly, once you touch disk I/O, you've lost: your only hope is to touch it as few times as possible.
Given that your file is 500MB, and you have to read the all the bytes of the disk file anyway, and you have to a line-oriented pass to process each line, I don't quite see why there's a requirement to do this in two passes. Why turn this from what's fundamentally a one-pass algorithm into a two-pass algorithm?
Without you showing any other code, we can't really say if what you've done is fast or slow or not. Without measurement, we can't say anything substantive. Did you try writing the direct code with bufio.Scanner() first, and then measure performance?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论