为什么通过TeeReader读取的tar.gz文件的tar部分的md5哈希值是错误的?

huangapple go评论158阅读模式
英文:

Why is the md5 hash of the tar-part of a tar.gz via TeeReader wrong?

问题

我只是在尝试使用archive/tar和compress/gzip进行一些备份的自动处理。

我的问题是:我有各种.tar文件和.tar.gz文件,因此我想提取.tar.gz文件的哈希(md5),以及.tar文件的哈希(md5),最好在一次运行中完成。

到目前为止,我所拥有的示例代码对于.tar.gz文件中的文件的哈希以及.gz文件的哈希都能正常工作,但是.tar文件的哈希是错误的,我找不出问题在哪里。

我查看了tar/reader.go文件,发现其中有一些跳过的操作,但我认为一切都应该通过io.Reader接口运行,因此TeeReader应该仍然可以捕获所有字节。

package main

import (
    "archive/tar"
    "compress/gzip"
    "crypto/md5"
    "fmt"
    "io"
    "os"
)

func main() {
    tgz, _ := os.Open("tb.tar.gz")
    gzMd5 := md5.New()
    gz, _ := gzip.NewReader(io.TeeReader(tgz, gzMd5))
    tarMd5 := md5.New()
    tr := tar.NewReader(io.TeeReader(gz, tarMd5))
    for {
        fileMd5 := md5.New()
        hdr, err := tr.Next()
        if err == io.EOF {
            break
        }
        io.Copy(fileMd5, tr)
        fmt.Printf("%x  %s\n", fileMd5.Sum(nil), hdr.Name)
    }
    fmt.Printf("%x  tb.tar\n", tarMd5.Sum(nil))
    fmt.Printf("%x  tb.tar.gz\n", gzMd5.Sum(nil))
}

现在以以下示例为例:

$ echo "a" > a.txt
$ echo "b" > b.txt
$ tar cf tb.tar a.txt b.txt 
$ gzip -c tb.tar > tb.tar.gz
$ md5sum a.txt b.txt tb.tar tb.tar.gz

60b725f10c9c85c70d97880dfe8191b3  a.txt
3b5d5c3712955042212316173ccf37be  b.txt
501352dcd8fbd0b8e3e887f7dafd9392  tb.tar
90d6ba204493d8e54d3b3b155bb7f370  tb.tar.gz

在Linux Mint 14(基于Ubuntu 12.04)上,使用来自Ubuntu存储库的go 1.02,我的go程序的结果是:

$ go run tarmd5.go 
60b725f10c9c85c70d97880dfe8191b3  a.txt
3b5d5c3712955042212316173ccf37be  b.txt
a26ddab1c324780ccb5199ef4dc38691  tb.tar
90d6ba204493d8e54d3b3b155bb7f370  tb.tar.gz

所以除了tb.tar之外,所有的哈希都是预期的。
(当然,如果您重新尝试此示例,您的.tar和.tar.gz将与此不同,因为时间戳不同)

如果您有关于如何使其工作的任何提示,将不胜感激,我真的希望能在一次运行中完成(使用TeeReaders)。

英文:

I was just experimenting with archive/tar and compress/gzip, for automated processing of some backups I have.

My problem hereby is: I have various .tar files and .tar.gz files floating around, and thus I want to extract the hash (md5) of the .tar.gz file, and the hash (md5) of the .tar file as well, ideally in one run.

The example code I have so far, works perfectly fine for the hashes of the files in the .tar.gz as well for the .gz, but the hash for the .tar is wrong and I can't find out what the problem is.

I looked at the tar/reader.go file and I saw that there is some skipping in there, yet I thought everything should run over the io.Reader interface and thus the TeeReader should still catch all the bytes.

package main

import (
    "archive/tar"
    "compress/gzip"
    "crypto/md5"
    "fmt"
    "io"
    "os"
)

func main() {
    tgz, _ := os.Open("tb.tar.gz")
    gzMd5 := md5.New()
    gz, _ := gzip.NewReader(io.TeeReader(tgz, gzMd5))
    tarMd5 := md5.New()
    tr := tar.NewReader(io.TeeReader(gz, tarMd5))
    for {
        fileMd5 := md5.New()
        hdr, err := tr.Next()
        if err == io.EOF {
            break
        }
        io.Copy(fileMd5, tr)
        fmt.Printf("%x  %s\n", fileMd5.Sum(nil), hdr.Name)
    }
    fmt.Printf("%x  tb.tar\n", tarMd5.Sum(nil))
    fmt.Printf("%x  tb.tar.gz\n", gzMd5.Sum(nil))
}

Now for the following example:

$ echo "a" > a.txt
$ echo "b" > b.txt
$ tar cf tb.tar a.txt b.txt 
$ gzip -c tb.tar > tb.tar.gz
$ md5sum a.txt b.txt tb.tar tb.tar.gz

60b725f10c9c85c70d97880dfe8191b3  a.txt
3b5d5c3712955042212316173ccf37be  b.txt
501352dcd8fbd0b8e3e887f7dafd9392  tb.tar
90d6ba204493d8e54d3b3b155bb7f370  tb.tar.gz

On Linux Mint 14 (based on Ubuntu 12.04) with go 1.02 from the Ubuntu repositories the result for my go program is:

$ go run tarmd5.go 
60b725f10c9c85c70d97880dfe8191b3  a.txt
3b5d5c3712955042212316173ccf37be  b.txt
a26ddab1c324780ccb5199ef4dc38691  tb.tar
90d6ba204493d8e54d3b3b155bb7f370  tb.tar.gz

So all hashes except for tb.tar are as expected.
(Of course if you retry that example your .tar and .tar.gz will be different from this, because of different timestamps)

Any hint about how to get it work would be greatly appreciated, I really would prefer to have it in 1 run though (with the TeeReaders).

答案1

得分: 5

问题发生的原因是tar不会从读取器中读取每个字节。在对每个文件进行哈希处理后,您需要清空读取器以确保读取和哈希处理每个字节。我通常使用io.Copy()来读取直到EOF的方式来实现这一点。

package main

import (
    "archive/tar"
    "compress/gzip"
    "crypto/md5"
    "fmt"
    "io"
    "io/ioutil"
    "os"
)

func main() {
    tgz, _ := os.Open("tb.tar.gz")
    gzMd5 := md5.New()
    gz, _ := gzip.NewReader(io.TeeReader(tgz, gzMd5))
    tarMd5 := md5.New()
    tee := io.TeeReader(gz, tarMd5) // 需要稍后使用读取器
    tr := tar.NewReader(tee)
    for {
        fileMd5 := md5.New()
        hdr, err := tr.Next()
        if err == io.EOF {
            break
        }
        io.Copy(fileMd5, tr)
        fmt.Printf("%x  %s\n", fileMd5.Sum(nil), hdr.Name)
    }
    io.Copy(ioutil.Discard, tee) // 读取tar文件中未使用的部分
    fmt.Printf("%x  tb.tar\n", tarMd5.Sum(nil))
    fmt.Printf("%x  tb.tar.gz\n", gzMd5.Sum(nil))
}

另一种选择是在tarMd5.Sum()调用之前添加io.Copy(tarMd5, gz)。即使我需要添加/修改四行代码而不是一行代码,我认为第一种方式更清晰。

英文:

The issue occurs because tar doesn't read every byte from your reader. After hashing each file, you need to empty the reader to ensure every byte is read and hashed. The way I normally do this is use io.Copy() to read until EOF.

package main

import (
	"archive/tar"
	"compress/gzip"
	"crypto/md5"
	"fmt"
	"io"
	"io/ioutil"
	"os"
)

func main() {
	tgz, _ := os.Open("tb.tar.gz")
	gzMd5 := md5.New()
	gz, _ := gzip.NewReader(io.TeeReader(tgz, gzMd5))
	tarMd5 := md5.New()
	tee := io.TeeReader(gz, tarMd5) // need the reader later
	tr := tar.NewReader(tee)
	for {
		fileMd5 := md5.New()
		hdr, err := tr.Next()
		if err == io.EOF {
			break
		}
		io.Copy(fileMd5, tr)
		fmt.Printf("%x  %s\n", fileMd5.Sum(nil), hdr.Name)
	}
	io.Copy(ioutil.Discard, tee) // read unused portions of the tar file
	fmt.Printf("%x  tb.tar\n", tarMd5.Sum(nil))
	fmt.Printf("%x  tb.tar.gz\n", gzMd5.Sum(nil))
}

Another option is to just add io.Copy(tarMd5, gz) before your tarMd5.Sum() call. I think the first way is clearer even if I needed to add/modify four lines instead of one.

huangapple
  • 本文由 发表于 2013年3月3日 04:37:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/15179194.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定