为什么Java和Go的gzip得到不同的结果?

huangapple go评论83阅读模式
英文:

Why do gzip of Java and Go get different results?

问题

首先,我的Java版本代码如下:

String str = "helloworld";
ByteArrayOutputStream localByteArrayOutputStream = new ByteArrayOutputStream(str.length());
GZIPOutputStream localGZIPOutputStream = new GZIPOutputStream(localByteArrayOutputStream);
localGZIPOutputStream.write(str.getBytes("UTF-8"));
localGZIPOutputStream.close();
localByteArrayOutputStream.close();
for(int i = 0; i < localByteArrayOutputStream.toByteArray().length; i++){
    System.out.println(localByteArrayOutputStream.toByteArray()[i]);
}

输出结果为:

31
-117
8
0
0
0
0
0
0
0
-53
72
-51
-55
-55
47
-49
47
-54
73
1
0
-83
32
-21
-7
10
0
0
0

然后是Go版本的代码:

var gzBf bytes.Buffer
gzSizeBf := bufio.NewWriterSize(&gzBf, len(str))
gz := gzip.NewWriter(gzSizeBf)
gz.Write([]byte(str))
gz.Flush()
gz.Close()
gzSizeBf.Flush()
GB := (&gzBf).Bytes()
for i := 0; i < len(GB); i++ {
    fmt.Println(GB[i])
}

输出结果为:

31
139
8
0
0
9
110
136
0
255
202
72
205
201
201
47
207
47
202
73
1
0
0
0
255
255
1
0
0
255
255
173
32
235
249
10
0
0
0

为什么会这样呢?

起初我以为可能是这两种语言的字节读取方法不同导致的。但我注意到0永远无法转换为9。而且[]byte的大小也不同。

我写错了代码吗?有没有办法让我的Go程序得到与Java程序相同的输出结果?

谢谢!

英文:

Firstly, my Java version:

string str = &quot;helloworld&quot;;
ByteArrayOutputStream localByteArrayOutputStream = new ByteArrayOutputStream(str.length());
GZIPOutputStream localGZIPOutputStream = new GZIPOutputStream(localByteArrayOutputStream);
localGZIPOutputStream.write(str.getBytes(&quot;UTF-8&quot;));
localGZIPOutputStream.close();
localByteArrayOutputStream.close();
for(int i = 0;i &lt; localByteArrayOutputStream.toByteArray().length;i ++){
	System.out.println(localByteArrayOutputStream.toByteArray()[i]);
}

and output is:

31
-117
8
0
0
0
0
0
0
0
-53
72
-51
-55
-55
47
-49
47
-54
73
1
0
-83
32
-21
-7
10
0
0
0

Then the Go version:

var gzBf bytes.Buffer
gzSizeBf := bufio.NewWriterSize(&amp;gzBf, len(str))
gz := gzip.NewWriter(gzSizeBf)
gz.Write([]byte(str))
gz.Flush()
gz.Close()
gzSizeBf.Flush()
GB := (&amp;gzBf).Bytes()
for i := 0; i &lt; len(GB); i++ {
	fmt.Println(GB[i])
}

output:

31
139
8
0
0
9
110
136
0
255
202
72
205
201
201
47
207
47
202
73
1
0
0
0
255
255
1
0
0
255
255
173
32
235
249
10
0
0
0

Why?

I thought it might be caused by different byte reading methods of those two languages at first. But I noticed that 0 can never convert to 9. And the sizes of []byte are different.

Have I written wrong code? Is there any way to make my Go program get the same output as the Java program?

Thanks!

答案1

得分: 18

首先,Java中的byte类型是有符号的,它的范围是-128..127,而Go中的byteuint8的别名,范围是0..255。因此,如果你想比较结果,你需要将Java中的负值右移256位(加上256)。

提示:要以无符号方式显示Java的byte值,可以使用byteValue & 0xff,它将其转换为int,使用byte的8位作为int的最低8位。或者更好的方法是以十六进制形式显示两个结果,这样你就不必关心符号。

即使你进行了位移,仍然会看到不同的结果。这可能是由于不同语言中的默认压缩级别不同。请注意,尽管Java和Go中的默认压缩级别都是6,但这并没有指定,不同的实现可以选择不同的值,并且在将来的版本中可能会发生变化。

即使压缩级别相同,仍然可能遇到差异,因为gzip基于LZ77Huffman编码,它使用基于频率(概率)构建的树来决定输出代码,如果不同的输入字符或位模式具有相同的频率,则分配的代码可能会在它们之间变化,并且多个输出位模式可能具有相同的长度,因此可能选择不同的位模式。

如果你想要相同的输出,唯一的方法是(请参阅下面的注释!)使用0压缩级别(不进行压缩)。在Go中使用压缩级别gzip.NoCompression,在Java中使用Deflater.NO_COMPRESSION

Java代码:

GZIPOutputStream gzip = new GZIPOutputStream(localByteArrayOutputStream) {
    {
        def.setLevel(Deflater.NO_COMPRESSION);
    }
};

Go代码:

gz, err := gzip.NewWriterLevel(gzSizeBf, gzip.NoCompression)

但是我不会担心不同的输出。gzip是一个标准,即使输出不同,你仍然可以使用任何gzip解码器解压输出,无论用于压缩数据的是哪种解码器,解码后的数据将完全相同。

以下是简化且扩展的版本:

虽然这并不重要,但你的代码过于复杂。你可以像这样简化它们(这些版本还包括设置0压缩级别和转换负的Java byte值):

Java版本:

ByteArrayOutputStream buf = new ByteArrayOutputStream();
GZIPOutputStream gz = new GZIPOutputStream(buf) {
    { def.setLevel(Deflater.NO_COMPRESSION); }
};
gz.write("helloworld".getBytes("UTF-8"));
gz.close();
for (byte b : buf.toByteArray())
    System.out.print((b & 0xff) + " ");

Go版本:

var buf bytes.Buffer
gz, _ := gzip.NewWriterLevel(&buf, gzip.NoCompression)
gz.Write([]byte("helloworld"))
gz.Close()
fmt.Println(buf.Bytes())

注意:

gzip格式允许在输出中包含一些额外的字段(头部)。

在Go中,这些字段由gzip.Header类型表示:

type Header struct {
    Comment string    // 注释
    Extra   []byte    // "额外数据"
    ModTime time.Time // 修改时间
    Name    string    // 文件名
    OS      byte      // 操作系统类型
}

它可以通过Writer.Header结构字段访问。Go会设置和插入这些字段,而Java则不会(将头部字段保留为零)。因此,即使在两种语言中都将压缩级别设置为0,输出也不会相同(但是“压缩”的数据在两个输出中将匹配)。

不幸的是,标准的Java没有提供一种设置/添加这些字段的方法/接口,而Go也没有使填充输出中的Header字段成为可选项,因此你将无法生成完全相同的输出。

一个选择是在Java中使用第三方的GZip库,该库支持设置这些字段。Apache Commons Compress就是这样一个例子,它包含一个GzipCompressorOutputStream类,该类有一个构造函数,允许传递一个GzipParameters实例。这个GzipParameters相当于gzip.Header结构。只有使用这个库,你才能生成完全相同的输出。

但是正如前面提到的,生成完全相同的输出在实际中没有任何价值。

英文:

First thing is that the byte type in Java is signed, it has a range of -128..127, while in Go byte is an alias of uint8 and has a range of 0..255. So if you want to compare the results, you have to shift negative Java values by 256 (add 256).

Tip: To display a Java byte value in an unsigned fashion, use: byteValue &amp; 0xff which converts it to int using the 8 bits of the byte as the lowest 8 bits in the int. Or better: display both results in hex form so you don't have to care about sign-ness...

Even if you do the shift, you will still see different results. That might be due to different default compression level in the different languages. Note that although the default compression level is 6 in both Java and Go, this is not specified and different implementations are allowed to choose different values, and it might also change in future releases.

And even if the compression level would be the same, you might still encounter differences because gzip is based on LZ77 and Huffman coding which uses a tree built on frequency (probability) to decide the output codes and if different input characters or bit patterns have the same frequency, assigned codes might vary between them, and moreover multiple output bit patterns might have the same length and therefore a different one might be chosen.

If you want the same output, the only way would be (see notes below!) to use the 0 compression level (not to compress at all). In Go use the compression level gzip.NoCompression and in Java use the Deflater.NO_COPMRESSION.

Java:

GZIPOutputStream gzip = new GZIPOutputStream(localByteArrayOutputStream) {
    {
        def.setLevel(Deflater.NO_COMPRESSION);
    }
};

Go:

gz, err := gzip.NewWriterLevel(gzSizeBf, gzip.NoCompression)

But I wouldn't worry about the different outputs. Gzip is a standard, even if outputs are not the same, you will still be able to decompress the output with any gzip decoders whichever was used to compress the data, and the decoded data will be exactly the same.

Here are the simplified, extended versions:

Not that it matters, but your codes are unneccessarily complex. You could simplify them like this (these versions also include setting 0 compression level and converting negative Java byte values):

Java version:

ByteArrayOutputStream buf = new ByteArrayOutputStream();
GZIPOutputStream gz = new GZIPOutputStream(buf) {
    { def.setLevel(Deflater.NO_COMPRESSION); }
};
gz.write(&quot;helloworld&quot;.getBytes(&quot;UTF-8&quot;));
gz.close();
for (byte b : buf.toByteArray())
    System.out.print((b &amp; 0xff) + &quot; &quot;);

Go version:

var buf bytes.Buffer
gz, _ := gzip.NewWriterLevel(&amp;buf, gzip.NoCompression)
gz.Write([]byte(&quot;helloworld&quot;))
gz.Close()
fmt.Println(buf.Bytes())

NOTES:

The gzip format allows some extra fields (headers) to be included in the output.

In Go these are represented by the gzip.Header type:

type Header struct {
    Comment string    // comment
    Extra   []byte    // &quot;extra data&quot;
    ModTime time.Time // modification time
    Name    string    // file name
    OS      byte      // operating system type
}

And it is accessible via the Writer.Header struct field. Go sets and inserts them, while Java does not (leaves header fields zero). So even if you set compression level to 0 in both languages, the output will not be the same (but the "compressed" data will match in both outputs).

Unfortunately the standard Java does not provide a way/interface to set/add these fields, and Go does not make it optional to fill the Header fields in the output, so you will not be able to generate exact outputs.

An option would be to use a 3rd party GZip library for Java which supports setting these fields. Apache Commons Compress is such an example, it contains a GzipCompressorOutputStream class which has a constructor which allows a GzipParameters instance to be passed. This GzipParameters is the equvivalent of the gzip.Header structure. Only using this would you be able to generate exact output.

But as mentioned, generating exact output has no real-life value.

答案2

得分: 9

RFC 1952中可以看出,GZip文件头的结构如下:

+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more--&gt;)
+---+---+---+---+---+---+---+---+---+---+

根据您提供的输出,我们可以看到:

                          |    Java |          Go
ID1                       |      31 |          31
ID2                       |     139 |         139
CM (压缩方法)             |       8 |           8
FLG (标志位)             |       0 |           0
MTIME (修改时间)         | 0 0 0 0 | 0 9 110 136
XFL (额外标志位)         |       0 |           0
OS (操作系统)             |       0 |         255

因此,我们可以看到Go语言设置了头部的修改时间字段,并将操作系统设置为255(未知),而不是0(FAT文件系统)。在其他方面,它们表明文件以相同的方式进行了压缩。

一般来说,这些差异是无害的。如果您想确定两个压缩文件是否相同,那么您应该比较文件的解压缩版本。

英文:

From RFC 1952, the GZip file header is structured as:

+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more--&gt;)
+---+---+---+---+---+---+---+---+---+---+

Looking at the output you've provided, we have:

                          |    Java |          Go
ID1                       |      31 |          31
ID2                       |     139 |         139
CM (compression method)   |       8 |           8
FLG (flags)               |       0 |           0
MTIME (modification time) | 0 0 0 0 | 0 9 110 136
XFL (extra flags)         |       0 |           0
OS (operating system)     |       0 |         255

So we can see that Go is setting the modification time field of the header, and setting the operating system to 255 (unknown) rather than 0 (FAT file system). In other respects they indicate that the file is compressed in the same way.

In general these sorts of differences are harmless. If you want to determine if two compressed files are the same, then you should really compare the decompressed versions of the files though.

huangapple
  • 本文由 发表于 2015年3月12日 13:54:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/29002769.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定