2013年3月3日 04:37:12go评论205阅读模式

英文:

Why is the md5 hash of the tar-part of a tar.gz via TeeReader wrong?

问题

我只是在尝试使用archive/tar和compress/gzip进行一些备份的自动处理。

我的问题是：我有各种.tar文件和.tar.gz文件，因此我想提取.tar.gz文件的哈希（md5），以及.tar文件的哈希（md5），最好在一次运行中完成。

到目前为止，我所拥有的示例代码对于.tar.gz文件中的文件的哈希以及.gz文件的哈希都能正常工作，但是.tar文件的哈希是错误的，我找不出问题在哪里。

我查看了tar/reader.go文件，发现其中有一些跳过的操作，但我认为一切都应该通过io.Reader接口运行，因此TeeReader应该仍然可以捕获所有字节。

package main
import (
    "archive/tar"
    "compress/gzip"
    "crypto/md5"
    "fmt"
    "io"
    "os"
)
func main() {
    tgz, _ := os.Open("tb.tar.gz")
    gzMd5 := md5.New()
    gz, _ := gzip.NewReader(io.TeeReader(tgz, gzMd5))
    tarMd5 := md5.New()
    tr := tar.NewReader(io.TeeReader(gz, tarMd5))
    for {
        fileMd5 := md5.New()
        hdr, err := tr.Next()
        if err == io.EOF {
            break
        }
        io.Copy(fileMd5, tr)
        fmt.Printf("%x  %s\n", fileMd5.Sum(nil), hdr.Name)
    }
    fmt.Printf("%x  tb.tar\n", tarMd5.Sum(nil))
    fmt.Printf("%x  tb.tar.gz\n", gzMd5.Sum(nil))
}

现在以以下示例为例：

$ echo "a" > a.txt
$ echo "b" > b.txt
$ tar cf tb.tar a.txt b.txt 
$ gzip -c tb.tar > tb.tar.gz
$ md5sum a.txt b.txt tb.tar tb.tar.gz
60b725f10c9c85c70d97880dfe8191b3  a.txt
3b5d5c3712955042212316173ccf37be  b.txt
501352dcd8fbd0b8e3e887f7dafd9392  tb.tar
90d6ba204493d8e54d3b3b155bb7f370  tb.tar.gz

在Linux Mint 14（基于Ubuntu 12.04）上，使用来自Ubuntu存储库的go 1.02，我的go程序的结果是：

$ go run tarmd5.go 
60b725f10c9c85c70d97880dfe8191b3  a.txt
3b5d5c3712955042212316173ccf37be  b.txt
a26ddab1c324780ccb5199ef4dc38691  tb.tar
90d6ba204493d8e54d3b3b155bb7f370  tb.tar.gz

所以除了tb.tar之外，所有的哈希都是预期的。
（当然，如果您重新尝试此示例，您的.tar和.tar.gz将与此不同，因为时间戳不同）

如果您有关于如何使其工作的任何提示，将不胜感激，我真的希望能在一次运行中完成（使用TeeReaders）。

英文:

I was just experimenting with archive/tar and compress/gzip, for automated processing of some backups I have.

My problem hereby is: I have various .tar files and .tar.gz files floating around, and thus I want to extract the hash (md5) of the .tar.gz file, and the hash (md5) of the .tar file as well, ideally in one run.

The example code I have so far, works perfectly fine for the hashes of the files in the .tar.gz as well for the .gz, but the hash for the .tar is wrong and I can't find out what the problem is.

I looked at the tar/reader.go file and I saw that there is some skipping in there, yet I thought everything should run over the io.Reader interface and thus the TeeReader should still catch all the bytes.

package main
import (
    &quot;archive/tar&quot;
    &quot;compress/gzip&quot;
    &quot;crypto/md5&quot;
    &quot;fmt&quot;
    &quot;io&quot;
    &quot;os&quot;
)
func main() {
    tgz, _ := os.Open(&quot;tb.tar.gz&quot;)
    gzMd5 := md5.New()
    gz, _ := gzip.NewReader(io.TeeReader(tgz, gzMd5))
    tarMd5 := md5.New()
    tr := tar.NewReader(io.TeeReader(gz, tarMd5))
    for {
        fileMd5 := md5.New()
        hdr, err := tr.Next()
        if err == io.EOF {
            break
        }
        io.Copy(fileMd5, tr)
        fmt.Printf(&quot;%x  %s\n&quot;, fileMd5.Sum(nil), hdr.Name)
    }
    fmt.Printf(&quot;%x  tb.tar\n&quot;, tarMd5.Sum(nil))
    fmt.Printf(&quot;%x  tb.tar.gz\n&quot;, gzMd5.Sum(nil))
}

Now for the following example:

$ echo &quot;a&quot; &gt; a.txt
$ echo &quot;b&quot; &gt; b.txt
$ tar cf tb.tar a.txt b.txt 
$ gzip -c tb.tar &gt; tb.tar.gz
$ md5sum a.txt b.txt tb.tar tb.tar.gz
60b725f10c9c85c70d97880dfe8191b3  a.txt
3b5d5c3712955042212316173ccf37be  b.txt
501352dcd8fbd0b8e3e887f7dafd9392  tb.tar
90d6ba204493d8e54d3b3b155bb7f370  tb.tar.gz

On Linux Mint 14 (based on Ubuntu 12.04) with go 1.02 from the Ubuntu repositories the result for my go program is:

$ go run tarmd5.go 
60b725f10c9c85c70d97880dfe8191b3  a.txt
3b5d5c3712955042212316173ccf37be  b.txt
a26ddab1c324780ccb5199ef4dc38691  tb.tar
90d6ba204493d8e54d3b3b155bb7f370  tb.tar.gz

So all hashes except for tb.tar are as expected.
(Of course if you retry that example your .tar and .tar.gz will be different from this, because of different timestamps)

Any hint about how to get it work would be greatly appreciated, I really would prefer to have it in 1 run though (with the TeeReaders).

答案1

得分: 5

问题发生的原因是tar不会从读取器中读取每个字节。在对每个文件进行哈希处理后，您需要清空读取器以确保读取和哈希处理每个字节。我通常使用io.Copy()来读取直到EOF的方式来实现这一点。

package main
import (
    "archive/tar"
    "compress/gzip"
    "crypto/md5"
    "fmt"
    "io"
    "io/ioutil"
    "os"
)
func main() {
    tgz, _ := os.Open("tb.tar.gz")
    gzMd5 := md5.New()
    gz, _ := gzip.NewReader(io.TeeReader(tgz, gzMd5))
    tarMd5 := md5.New()
    tee := io.TeeReader(gz, tarMd5) // 需要稍后使用读取器
    tr := tar.NewReader(tee)
    for {
        fileMd5 := md5.New()
        hdr, err := tr.Next()
        if err == io.EOF {
            break
        }
        io.Copy(fileMd5, tr)
        fmt.Printf("%x  %s\n", fileMd5.Sum(nil), hdr.Name)
    }
    io.Copy(ioutil.Discard, tee) // 读取tar文件中未使用的部分
    fmt.Printf("%x  tb.tar\n", tarMd5.Sum(nil))
    fmt.Printf("%x  tb.tar.gz\n", gzMd5.Sum(nil))
}

另一种选择是在tarMd5.Sum()调用之前添加io.Copy(tarMd5, gz)。即使我需要添加/修改四行代码而不是一行代码，我认为第一种方式更清晰。

英文:

The issue occurs because tar doesn't read every byte from your reader. After hashing each file, you need to empty the reader to ensure every byte is read and hashed. The way I normally do this is use io.Copy() to read until EOF.

package main
import (
	&quot;archive/tar&quot;
	&quot;compress/gzip&quot;
	&quot;crypto/md5&quot;
	&quot;fmt&quot;
	&quot;io&quot;
	&quot;io/ioutil&quot;
	&quot;os&quot;
)
func main() {
	tgz, _ := os.Open(&quot;tb.tar.gz&quot;)
	gzMd5 := md5.New()
	gz, _ := gzip.NewReader(io.TeeReader(tgz, gzMd5))
	tarMd5 := md5.New()
	tee := io.TeeReader(gz, tarMd5) // need the reader later
	tr := tar.NewReader(tee)
	for {
		fileMd5 := md5.New()
		hdr, err := tr.Next()
		if err == io.EOF {
			break
		}
		io.Copy(fileMd5, tr)
		fmt.Printf(&quot;%x  %s\n&quot;, fileMd5.Sum(nil), hdr.Name)
	}
	io.Copy(ioutil.Discard, tee) // read unused portions of the tar file
	fmt.Printf(&quot;%x  tb.tar\n&quot;, tarMd5.Sum(nil))
	fmt.Printf(&quot;%x  tb.tar.gz\n&quot;, gzMd5.Sum(nil))
}

Another option is to just add io.Copy(tarMd5, gz) before your tarMd5.Sum() call. I think the first way is clearer even if I needed to add/modify four lines instead of one.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么通过TeeReader读取的tar.gz文件的tar部分的md5哈希值是错误的？

问题

答案1

多个文件系统缓存相互干扰引发混乱。

致命错误：goroutine 处于休眠状态 – 死锁

处理内存占用过高的应用程序的最佳方法是什么？使用Mmap、内存还是缓存？

反射：是否可以获取底层的类型信息？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。