英文:
File reading and checksums in go. Difference between methods
问题
最近我对在Go语言中为文件创建校验和很感兴趣。我的代码可以处理小文件和大文件。我尝试了两种方法,第一种使用ioutil.ReadFile("filename")
,第二种使用os.Open("filename")
。
示例:
第一个函数使用io/ioutil
处理小文件。但是当我尝试复制一个大文件时,我的内存会被占满,对于一个1.5GB的ISO文件,它会使用3GB的内存。
func byteCopy(fileToCopy string) {
file, err := ioutil.ReadFile(fileToCopy) //1.5GB的文件
omg(err) //错误处理函数
ioutil.WriteFile("2.iso", file, 0777)
os.Remove("2.iso")
}
当我想要使用crypto/sha512
和io/ioutil
创建校验和时,情况更糟。它永远无法完成并且会因为内存耗尽而中止。
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
当使用下面的函数时,一切都正常。
func ioHash() {
f, err := os.Open(iso) //iso是一个大约1.5TB的文件
omg(err) //错误处理函数
defer f.Close()
h := sha512.New()
io.Copy(h, f)
fmt.Printf("%x", h.Sum(nil))
}
我的问题:
为什么ioutil.ReadFile()
函数不能正常工作?1.5GB的文件不应该占满我的16GB内存。我现在不知道该从哪里查找问题。有人能解释一下这些方法之间的区别吗?我在阅读Go文档和示例时没有理解。拥有可用的代码很好,但是理解为什么它能工作更重要。
提前感谢!
英文:
Recently I'm into creating checksums for files in go. My code is working with small and big files. I tried two methods, the first uses ioutil.ReadFile("filename")
and the second is working with os.Open("filename")
.
Examples:
The first function is working with the io/ioutil
and works for small files. When I try to copy a big file my ram gets blastet and for a 1.5GB iso it uses 3GB of ram.
func byteCopy(fileToCopy string) {
file, err := ioutil.ReadFile(fileToCopy) //1.5GB file
omg(err) //error handling function
ioutil.WriteFile("2.iso", file, 0777)
os.Remove("2.iso")
}
Even worse when I want to create a checksum with crypto/sha512
and io/ioutil
.
It will never finish and abort because it runs out of memory.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
When using the function below everything works fine.
func ioHash() {
f, err := os.Open(iso) //iso is a big ~ 1.5tb file
omg(err) //error handling function
defer f.Close()
h := sha512.New()
io.Copy(h, f)
fmt.Printf("%x", h.Sum(nil))
}
My Question:
Why is the ioutil.ReadFile()
function not working right? The 1.5GB file should not fill my 16GB of ram. I don't know where to look right now.
Could somebody explain the differences between the methods? I don't get it with reading the go-doc and examples.
Having usable code is nice, but understanding why its working is way above that.
Thanks in advance!
答案1
得分: 3
以下代码并不像你想的那样工作。
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
这段代码首先读取了一个大小为1.5GB的iso文件。正如jnml指出的那样,它不断地创建越来越大的缓冲区来填充它。最终,总的缓冲区大小不小于1.5GB,不大于1.875GB(根据当前的实现)。
然而,在此之后,你又创建了另一个缓冲区!h.Sum(file)
并不是对文件进行哈希操作,而是将当前的哈希值附加到文件末尾!这可能会导致另一个分配操作。
真正的问题是,你将已附加哈希值的文件使用%x
进行打印输出。fmt.Printf
实际上使用了与jnml指出的ioutil.ReadAll
相同类型的方法进行预计算。因此,它不断地分配越来越大的缓冲区来存储文件的十六进制表示。由于每个字母占据4位,这意味着我们至少需要一个大小为3GB的缓冲区,不超过3.75GB。
这意味着你的活动缓冲区可能会达到5.625GB。再加上垃圾回收器并不完美,无法清除所有中间缓冲区,它很容易就会填满你的空间。
正确的编写代码的方式应该是:
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
h.Write(file)
fmt.Printf("%x", h.Sum(nil))
}
这样做减少了很多分配操作。
总之,ReadFile
很少是你想要使用的方法。当有选择时,使用IO流(使用读取器和写入器)始终是最好的方式。当你使用io.Copy
时,不仅分配更少,而且可以同时进行哈希和读取磁盘操作。在你的ReadFile
示例中,这两个资源是同步使用的,而它们并不依赖于彼此。
英文:
The following code doesn't do what you think it does.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
This first reads your 1.5GB iso. As jnml pointed out, it continuously makes bigger and bigger buffers to fill it. In the end, And total buffer size is no less than 1.5GB and no greater than 1.875GB (by the current implementation).
However, after that you then make another buffer! h.Sum(file)
doesn't hash file. It appends the current hash to file! This may or may not cause yet another allocation.
The real problem is that you are taking that file, now appended with the hash, and printing it with %x. Fmt actually pre-computes using the same type of method jnml pointed out that ioutil.ReadAll used. So it constantly allocated bigger and bigger buffers to store the hex of your file. Since each letter is 4 bits, that means we are talking about no less than a 3GB buffer for that and no greater than 3.75GB.
This means your active buffers may be as big 5.625GB. Combine that with the GC not being perfect and not removing all the intermediate buffers, and it could very easily fill your space.
The correct way to write that code would have been.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
h.Write(file)
fmt.Printf("%x", h.Sum(nil))
}
This doesn't do nearly the number the allocations.
The bottom line is that ReadFile is rarely what you want to use. IO streaming (using readers and writers) is always the best way when it is an option. Not only do you allocate much less when you use io.Copy, you also hash and read the disk concurrently. In your ReadFile example, the two resources are used synchronously when they don't depend on each other.
答案2
得分: 1
ioutil.ReadFile
的工作是正确的。你滥用系统资源是你的错,因为你知道你要处理的文件很大。
ioutil.ReadFile
是一个方便的助手函数,用于处理你事先确定会很小的文件,比如配置文件、大多数源代码文件等(实际上它对于文件大小<=1e9字节进行了优化,但这是一个实现细节,不是API合约的一部分。你的1.5GB文件强制它使用切片扩容,并在读取文件的过程中分配了多个大缓冲区来存储数据。)
即使你使用os.File
的另一种方法也不行。你应该使用"bufio"包来顺序处理大文件,参见bufio.NewReader
。
英文:
ioutil.ReadFile
is working right. It's your fault to abuse the system resources by using that function for things you know are huge.
ioutil.ReadFile
is a handy helper for files you're pretty sure in advance that they're going to be small. Like configuration files, most source code files etc. (Actually it's optimizing things for files <= 1e9 bytes, but that's an implementation detail and not part of the API contract. Your 1.5GB file forces it to use slice growing and thus allocating more than one big buffer for your data in the process of reading the file.)
Even your other approach using os.File
is not okay. You definitely should be using the "bufio" package for sequential processing of large files, see bufio.NewReader
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论