英文:
Is there a faster alternative to ioutil.ReadFile?
问题
我正在尝试编写一个基于MD5校验和检查文件重复的程序。
我不太确定是否有遗漏的地方,但是这个函数读取XCode安装程序应用(大约8GB)时使用了16GB的内存。
func search() {
unique := make(map[string]string)
files, err := ioutil.ReadDir(".")
if err != nil {
log.Println(err)
}
for _, file := range files {
fileName := file.Name()
fmt.Println("CHECKING:", fileName)
fi, err := os.Stat(fileName)
if err != nil {
fmt.Println(err)
continue
}
if fi.Mode().IsRegular() {
data, err := ioutil.ReadFile(fileName)
if err != nil {
fmt.Println(err)
continue
}
sum := md5.Sum(data)
hexDigest := hex.EncodeToString(sum[:])
if _, ok := unique[hexDigest]; ok == false {
unique[hexDigest] = fileName
} else {
fmt.Println("DUPLICATE:", fileName)
}
}
}
}
根据我的调试,问题出在文件读取上。
有没有更好的方法来解决这个问题?
谢谢。
英文:
I am trying to make a program for checking file duplicates based on md5 checksum.
Not really sure whether I am missing something or not, but this function reading the XCode installer app (it has like 8GB) uses 16GB of Ram
func search() {
unique := make(map[string]string)
files, err := ioutil.ReadDir(".")
if err != nil {
log.Println(err)
}
for _, file := range files {
fileName := file.Name()
fmt.Println("CHECKING:", fileName)
fi, err := os.Stat(fileName)
if err != nil {
fmt.Println(err)
continue
}
if fi.Mode().IsRegular() {
data, err := ioutil.ReadFile(fileName)
if err != nil {
fmt.Println(err)
continue
}
sum := md5.Sum(data)
hexDigest := hex.EncodeToString(sum[:])
if _, ok := unique[hexDigest]; ok == false {
unique[hexDigest] = fileName
} else {
fmt.Println("DUPLICATE:", fileName)
}
}
}
}
As per my debugging the issue is with the file reading
Is there a better approach to do that?
thanks
答案1
得分: 6
在Golang文档中有一个示例,涵盖了你的情况。
package main
import (
"crypto/md5"
"fmt"
"io"
"log"
"os"
)
func main() {
f, err := os.Open("file.txt")
if err != nil {
log.Fatal(err)
}
defer f.Close()
h := md5.New()
if _, err := io.Copy(h, f); err != nil {
log.Fatal(err)
}
fmt.Printf("%x", h.Sum(nil))
}
对于你的情况,确保在循环中关闭文件,而不是延迟关闭。或者将逻辑放入一个函数中。
英文:
There is an example in the Golang documentation, which covers your case.
package main
import (
"crypto/md5"
"fmt"
"io"
"log"
"os"
)
func main() {
f, err := os.Open("file.txt")
if err != nil {
log.Fatal(err)
}
defer f.Close()
h := md5.New()
if _, err := io.Copy(h, f); err != nil {
log.Fatal(err)
}
fmt.Printf("%x", h.Sum(nil))
}
For your case, just make sure to close the files in the loop and not defer them. Or put the logic into a function.
答案2
得分: 5
听起来,你的问题不是速度本身,而是16GB的RAM。
不要使用ReadFile将整个文件读入变量中;使用io.Copy从Open提供的Reader到hash/md5提供的Writer(md5.New返回一个hash.Hash,它嵌入了一个io.Writer)。这样只会一次复制一小部分,而不是将整个文件加载到RAM中。
这是Go语言中许多地方都有用的技巧;像text/template
、compress/gzip
、net/http
等包都是基于Reader和Writer工作的。使用它们,你通常不需要创建大型的[]byte
或string
;你可以将I/O接口连接在一起,让它们为你传递内容的片段。在一个垃圾回收的语言中,节省内存通常也能节省CPU的工作量。
英文:
Sounds like the 16GB RAM is your problem, not speed per se.
Don't read the entire file into a variable with ReadFile; io.Copy from the Reader that Open gives you to the Writer that hash/md5 provides (md5.New returns a hash.Hash, which embeds an io.Writer). That only copies a little bit at a time instead of pulling all of the file into RAM.
This is a trick useful in a lot of places in Go; packages like text/template
, compress/gzip
, net/http
, etc. work in terms of Readers and Writers. With them, you don't usually need to create huge []byte
s or string
s; you can hook I/O interfaces up to each other and let them pass around pieces of content for you. In a garbage collected language, saving memory tends to save you CPU work as well.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论