使用少量内存对大文件进行哈希处理

huangapple go评论85阅读模式
英文:

Hash large file using little memory

问题

我需要对非常大的文件(>10TB)进行哈希处理。所以我决定每个MB哈希处理128KB的数据。
我的想法是将文件分成1MB的块,只对每个块的前128KB进行哈希处理。

以下代码可以工作,但它使用了大量的内存,我无法确定原因...

func partialMD5Hash(filePath string) string {
	var blockSize int64 = 1024 * 1024
	var sampleSize int64 = 1024 * 128

	file, err := os.Open(filePath)
	if err != nil {
		return "ERROR"
	}
	defer file.Close()
	fileInfo, _ := file.Stat()
	fileSize := fileInfo.Size()

	hash := md5.New()

	var i int64
	for i = 0; i < fileSize / blockSize; i++ {
		sample := make([]byte, sampleSize)
		_, err = file.Read(sample)
		if err != nil {
			return "ERROR"
		}
		hash.Write(sample)

		_, err := file.Seek(blockSize-sampleSize, 1)
		if err != nil {
			return "ERROR"
		}
	}

	return hex.EncodeToString(hash.Sum(nil))
}

任何帮助将不胜感激!

英文:

I need to hash very large files (>10TB files). So I decided to hash 128KB per MB.
My idea is to divide the file into 1MB blocks and hash only the first 128KB of each block.

The following code works, but it uses insane amounts of memory and I can't tell why...

func partialMD5Hash(filePath string) string {
	var blockSize int64 = 1024 * 1024
	var sampleSize int64 = 1024 * 128

	file, err := os.Open(filePath)
	if err != nil {
		return &quot;ERROR&quot;
	}
	defer file.Close()
	fileInfo, _ := file.Stat()
	fileSize := fileInfo.Size()

	hash := md5.New()

	var i int64
	for i = 0; i &lt; fileSize / blockSize; i++ {
		sample := make([]byte, sampleSize)
		_, err = file.Read(sample)
		if err != nil {
			return &quot;ERROR&quot;
		}
		hash.Write(sample)

		_, err := file.Seek(blockSize-sampleSize, 1)
		if err != nil {
			return &quot;ERROR&quot;
		}
	}

	return hex.EncodeToString(hash.Sum(nil))
}

Any help will be appreciated!

答案1

得分: 1

这种方法和程序存在几个问题。

如果你想对一个大文件进行哈希处理,你必须对整个文件进行哈希。对文件的部分进行采样将无法检测到你没有采样的部分的修改。

你在每次迭代中都分配了一个新的缓冲区。相反,应该在for循环外分配一个缓冲区,并重复使用它。

此外,你似乎忽略了实际读取的字节数。所以:

    block := make([]byte, blockSize)
    for {
        n, err = file.Read(block)
        if n>0 {
           hash.Write(sample[:n])
        }
        if err==io.EOF {
           break
        }
        if err != nil {
            return "ERROR"
        }
    }

然而,以下方法更加简洁:

io.Copy(hash,file)
英文:

There are several problems with the approach, and with the program.

If you want to hash a large file, you have to hash all of it. Sampling parts of the file will not detect modifications to the parts you didn't sample.

You are allocating a new buffer for every iteration. Instead, allocate one buffer outside the for-loop, and reuse it.

Also, you seem to be ignoring how many bytes actually read. So:

    block := make([]byte, blockSize)
    for {
        n, err = file.Read(block)
        if n&gt;0 {
           hash.Write(sample[:n])
        }
        if err==io.EOF {
           break
        }
        if err != nil {
            return &quot;ERROR&quot;
        }
    }

However, the following would be much more concise:

io.Copy(hash,file)

huangapple
  • 本文由 发表于 2021年9月22日 22:17:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/69286042.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定