英文:
Hash large file using little memory
问题
我需要对非常大的文件(>10TB)进行哈希处理。所以我决定每个MB哈希处理128KB的数据。
我的想法是将文件分成1MB的块,只对每个块的前128KB进行哈希处理。
以下代码可以工作,但它使用了大量的内存,我无法确定原因...
func partialMD5Hash(filePath string) string {
var blockSize int64 = 1024 * 1024
var sampleSize int64 = 1024 * 128
file, err := os.Open(filePath)
if err != nil {
return "ERROR"
}
defer file.Close()
fileInfo, _ := file.Stat()
fileSize := fileInfo.Size()
hash := md5.New()
var i int64
for i = 0; i < fileSize / blockSize; i++ {
sample := make([]byte, sampleSize)
_, err = file.Read(sample)
if err != nil {
return "ERROR"
}
hash.Write(sample)
_, err := file.Seek(blockSize-sampleSize, 1)
if err != nil {
return "ERROR"
}
}
return hex.EncodeToString(hash.Sum(nil))
}
任何帮助将不胜感激!
英文:
I need to hash very large files (>10TB files). So I decided to hash 128KB per MB.
My idea is to divide the file into 1MB blocks and hash only the first 128KB of each block.
The following code works, but it uses insane amounts of memory and I can't tell why...
func partialMD5Hash(filePath string) string {
var blockSize int64 = 1024 * 1024
var sampleSize int64 = 1024 * 128
file, err := os.Open(filePath)
if err != nil {
return "ERROR"
}
defer file.Close()
fileInfo, _ := file.Stat()
fileSize := fileInfo.Size()
hash := md5.New()
var i int64
for i = 0; i < fileSize / blockSize; i++ {
sample := make([]byte, sampleSize)
_, err = file.Read(sample)
if err != nil {
return "ERROR"
}
hash.Write(sample)
_, err := file.Seek(blockSize-sampleSize, 1)
if err != nil {
return "ERROR"
}
}
return hex.EncodeToString(hash.Sum(nil))
}
Any help will be appreciated!
答案1
得分: 1
这种方法和程序存在几个问题。
如果你想对一个大文件进行哈希处理,你必须对整个文件进行哈希。对文件的部分进行采样将无法检测到你没有采样的部分的修改。
你在每次迭代中都分配了一个新的缓冲区。相反,应该在for循环外分配一个缓冲区,并重复使用它。
此外,你似乎忽略了实际读取的字节数。所以:
block := make([]byte, blockSize)
for {
n, err = file.Read(block)
if n>0 {
hash.Write(sample[:n])
}
if err==io.EOF {
break
}
if err != nil {
return "ERROR"
}
}
然而,以下方法更加简洁:
io.Copy(hash,file)
英文:
There are several problems with the approach, and with the program.
If you want to hash a large file, you have to hash all of it. Sampling parts of the file will not detect modifications to the parts you didn't sample.
You are allocating a new buffer for every iteration. Instead, allocate one buffer outside the for-loop, and reuse it.
Also, you seem to be ignoring how many bytes actually read. So:
block := make([]byte, blockSize)
for {
n, err = file.Read(block)
if n>0 {
hash.Write(sample[:n])
}
if err==io.EOF {
break
}
if err != nil {
return "ERROR"
}
}
However, the following would be much more concise:
io.Copy(hash,file)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论