问题

我需要对非常大的文件（>10TB）进行哈希处理。所以我决定每个MB哈希处理128KB的数据。
我的想法是将文件分成1MB的块，只对每个块的前128KB进行哈希处理。

以下代码可以工作，但它使用了大量的内存，我无法确定原因...

func partialMD5Hash(filePath string) string {
	var blockSize int64 = 1024 * 1024
	var sampleSize int64 = 1024 * 128

	file, err := os.Open(filePath)
	if err != nil {
		return "ERROR"
	}
	defer file.Close()
	fileInfo, _ := file.Stat()
	fileSize := fileInfo.Size()

	hash := md5.New()

	var i int64
	for i = 0; i < fileSize / blockSize; i++ {
		sample := make([]byte, sampleSize)
		_, err = file.Read(sample)
		if err != nil {
			return "ERROR"
		}
		hash.Write(sample)

		_, err := file.Seek(blockSize-sampleSize, 1)
		if err != nil {
			return "ERROR"
		}
	}

	return hex.EncodeToString(hash.Sum(nil))
}

任何帮助将不胜感激！

英文:

I need to hash very large files (>10TB files). So I decided to hash 128KB per MB.
My idea is to divide the file into 1MB blocks and hash only the first 128KB of each block.

The following code works, but it uses insane amounts of memory and I can't tell why...

func partialMD5Hash(filePath string) string {
	var blockSize int64 = 1024 * 1024
	var sampleSize int64 = 1024 * 128

	file, err := os.Open(filePath)
	if err != nil {
		return &quot;ERROR&quot;
	}
	defer file.Close()
	fileInfo, _ := file.Stat()
	fileSize := fileInfo.Size()

	hash := md5.New()

	var i int64
	for i = 0; i &lt; fileSize / blockSize; i++ {
		sample := make([]byte, sampleSize)
		_, err = file.Read(sample)
		if err != nil {
			return &quot;ERROR&quot;
		}
		hash.Write(sample)

		_, err := file.Seek(blockSize-sampleSize, 1)
		if err != nil {
			return &quot;ERROR&quot;
		}
	}

	return hex.EncodeToString(hash.Sum(nil))
}

Any help will be appreciated!

答案1

得分: 1

这种方法和程序存在几个问题。

如果你想对一个大文件进行哈希处理，你必须对整个文件进行哈希。对文件的部分进行采样将无法检测到你没有采样的部分的修改。

你在每次迭代中都分配了一个新的缓冲区。相反，应该在for循环外分配一个缓冲区，并重复使用它。

此外，你似乎忽略了实际读取的字节数。所以：

    block := make([]byte, blockSize)
    for {
        n, err = file.Read(block)
        if n>0 {
           hash.Write(sample[:n])
        }
        if err==io.EOF {
           break
        }
        if err != nil {
            return "ERROR"
        }
    }

然而，以下方法更加简洁：

io.Copy(hash,file)

英文:

There are several problems with the approach, and with the program.

If you want to hash a large file, you have to hash all of it. Sampling parts of the file will not detect modifications to the parts you didn't sample.

You are allocating a new buffer for every iteration. Instead, allocate one buffer outside the for-loop, and reuse it.

Also, you seem to be ignoring how many bytes actually read. So:

    block := make([]byte, blockSize)
    for {
        n, err = file.Read(block)
        if n&gt;0 {
           hash.Write(sample[:n])
        }
        if err==io.EOF {
           break
        }
        if err != nil {
            return &quot;ERROR&quot;
        }
    }

However, the following would be much more concise:

io.Copy(hash,file)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用少量内存对大文件进行哈希处理

问题

答案1

读取 Google Cloud Pubsub 消息并使用 Golang 将其写入 BigQuery。

关闭停止通道不会停止 goroutine。

在Golang中，有没有一种方法可以向现有的YAML文档中添加一个YAML节点？

在 Angular / Golang 项目中使用 JWT

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论