英文:
Is there a faster alternative to ioutil.ReadFile?
问题
我正在尝试编写一个基于MD5校验和检查文件重复的程序。
我不太确定是否有遗漏的地方,但是这个函数读取XCode安装程序应用(大约8GB)时使用了16GB的内存。
func search() {
    unique := make(map[string]string)
    files, err := ioutil.ReadDir(".")
    if err != nil {
        log.Println(err)
    }
    for _, file := range files {
        fileName := file.Name()
        fmt.Println("CHECKING:", fileName)
        fi, err := os.Stat(fileName)
        if err != nil {
            fmt.Println(err)
            continue
        }
        if fi.Mode().IsRegular() {
            data, err := ioutil.ReadFile(fileName)
            if err != nil {
                fmt.Println(err)
                continue
            }
            sum := md5.Sum(data)
            hexDigest := hex.EncodeToString(sum[:])
            if _, ok := unique[hexDigest]; ok == false {
                unique[hexDigest] = fileName
            } else {
                fmt.Println("DUPLICATE:", fileName)
            }
        }
    }
}
根据我的调试,问题出在文件读取上。
有没有更好的方法来解决这个问题?
谢谢。
英文:
I am trying to make a program for checking file duplicates based on md5 checksum.
Not really sure whether I am missing something or not, but this function reading the XCode installer app (it has like 8GB) uses 16GB of Ram
func search() {
	unique := make(map[string]string)
	files, err := ioutil.ReadDir(".")
	if err != nil {
		log.Println(err)
	}
	for _, file := range files {
		fileName := file.Name()
		fmt.Println("CHECKING:", fileName)
		fi, err := os.Stat(fileName)
		if err != nil {
			fmt.Println(err)
			continue
		}
		if fi.Mode().IsRegular() {
			data, err := ioutil.ReadFile(fileName)
			if err != nil {
				fmt.Println(err)
				continue
			}
			sum := md5.Sum(data)
			hexDigest := hex.EncodeToString(sum[:])
   			if _, ok := unique[hexDigest]; ok == false {
			 	unique[hexDigest] = fileName
			} else {
			 	fmt.Println("DUPLICATE:", fileName)
			}
		}
	}
}
As per my debugging the issue is with the file reading
Is there a better approach to do that?
thanks
答案1
得分: 6
在Golang文档中有一个示例,涵盖了你的情况。
package main
import (
	"crypto/md5"
	"fmt"
	"io"
	"log"
	"os"
)
func main() {
	f, err := os.Open("file.txt")
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()
	h := md5.New()
	if _, err := io.Copy(h, f); err != nil {
		log.Fatal(err)
	}
	fmt.Printf("%x", h.Sum(nil))
}
对于你的情况,确保在循环中关闭文件,而不是延迟关闭。或者将逻辑放入一个函数中。
英文:
There is an example in the Golang documentation, which covers your case.
package main
import (
	"crypto/md5"
	"fmt"
	"io"
	"log"
	"os"
)
func main() {
	f, err := os.Open("file.txt")
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()
	h := md5.New()
	if _, err := io.Copy(h, f); err != nil {
		log.Fatal(err)
	}
	fmt.Printf("%x", h.Sum(nil))
}
For your case, just make sure to close the files in the loop and not defer them. Or put the logic into a function.
答案2
得分: 5
听起来,你的问题不是速度本身,而是16GB的RAM。
不要使用ReadFile将整个文件读入变量中;使用io.Copy从Open提供的Reader到hash/md5提供的Writer(md5.New返回一个hash.Hash,它嵌入了一个io.Writer)。这样只会一次复制一小部分,而不是将整个文件加载到RAM中。
这是Go语言中许多地方都有用的技巧;像text/template、compress/gzip、net/http等包都是基于Reader和Writer工作的。使用它们,你通常不需要创建大型的[]byte或string;你可以将I/O接口连接在一起,让它们为你传递内容的片段。在一个垃圾回收的语言中,节省内存通常也能节省CPU的工作量。
英文:
Sounds like the 16GB RAM is your problem, not speed per se.
Don't read the entire file into a variable with ReadFile; io.Copy from the Reader that Open gives you to the Writer that hash/md5 provides (md5.New returns a hash.Hash, which embeds an io.Writer). That only copies a little bit at a time instead of pulling all of the file into RAM.
This is a trick useful in a lot of places in Go; packages like text/template, compress/gzip, net/http, etc. work in terms of Readers and Writers. With them, you don't usually need to create huge []bytes or strings; you can hook I/O interfaces up to each other and let them pass around pieces of content for you. In a garbage collected language, saving memory tends to save you CPU work as well.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论