英文:
Go: Mechanics of binary encoding
问题
我正在尝试创建一个新的二进制编码包,因为标准的Go编码/二进制包并不完全符合我的需求。
我不明白的是为什么在binary.PutUvarint
中,编码/二进制包使用的是x >>= 7
而不是x >>= 8
。如果我理解正确,这是有意将位数向右移动7位而不是8位,这导致存储一个uint64所需的总位数为80位,而不是明显需要的64位。为什么会这样?这一定与生成可变长度的字节切片有关,但是为什么>>7
会有帮助呢?
以下是用于参考的二进制编码函数:
func PutUvarint(buf []byte, x uint64) int {
i := 0
for x >= 0x80 {
buf[i] = byte(x) | 0x80
x >>= 7
i++
}
buf[i] = byte(x)
return i + 1
}
英文:
I'm trying to make a new binary encoding package because the standard Go encoding/binary package does not do exactly what I want.
What I do not understand is why encoding/binary is using x >>= 7
in binary.PutUvarint
instead of x >>= 8
. If I understand correctly this is deliberately shifting the bits by 7 instead of by 8, which results in an overall size of 80 bits to store a uint64 instead of the 64 bits that is clearly the number of bits required. Why? What is the reason for this? This must be something to do with the fact that it is generating variable length slices of bytes, but why would >>7 help this?
Here is the binary encoding function for your reference:
func PutUvarint(buf []byte, x uint64) int {
i := 0
for x >= 0x80 {
buf[i] = byte(x) | 0x80
x >>= 7
i++
}
buf[i] = byte(x)
return i + 1
}
答案1
得分: 8
[encoding/binary/varint.go]
包binary
该文件实现了对64位整数的"varint"编码。
编码规则如下:
- 无符号整数以每次7位的方式进行序列化,从最低有效位开始。
- 每个输出字节中的最高有效位(msb)指示是否有连续的字节(msb = 1)。
- 有符号整数使用"zig-zag"编码映射到无符号整数:正值x写为2x + 0,负值写为2(^x) + 1;即,负数取反,是否取反的信息编码在第0位。
设计说明:
最多需要10个字节来表示64位值。编码可以更紧凑:一个完整的64位值需要额外的一个字节来存储第63位。相反,可以使用前一个字节的msb来存储第63位,因为我们知道最多只有64位。这是一个微小的改进,可以将最大编码长度减少到9个字节。然而,这会破坏msb始终是"连续位"的不变性,因此使得该格式与更大数字(比如128位)的varint编码不兼容。
无损数据压缩的经典技术之一是Huffman编码,其中常见的符号通常用比不常见的符号更少的位数表示。实际上,较小的数字出现的频率最高。因此,如果我们可以高效地表示小的数字,即使较大的数字表示效率较低,我们也可以实现无损数据压缩。
Uvarint是一种使用一个或多个字节对无符号整数进行序列化的方法。较小的数字占用较少的字节。uvarint中的每个字节(除了最后一个字节)都设置了最高有效位(msb),这表示还有更多的字节。每个字节的低7位用于以7位为一组存储数字,最低有效组先存储。我们可以对任意位数的无符号整数进行这样的编码。
例如,
uint bits : uvarint bytes
7 7f : 1 7f
14 3fff : 2 ff7f
21 1fffff : 3 ffff7f
28 fffffff : 4 ffffff7f
35 7ffffffff : 5 ffffffff7f
42 3ffffffffff : 6 ffffffffff7f
49 1ffffffffffff : 7 ffffffffffff7f
56 ffffffffffffff : 8 ffffffffffffff7f
63 7fffffffffffffff : 9 ffffffffffffffff7f
64 ffffffffffffffff : 10 ffffffffffffffffff01
以此类推,我们可以根据需要对任意位数的uint进行编码。
如果我们有大量的数字,这些数字被表示为1到49位的无符号64位整数的uvarints,序列化为1到7字节的uvarints,那么我们不会在意将一些数字表示为57到64位的无符号64位整数的uvarints,序列化为9到10字节的uvarints。
参考资料:
英文:
> encoding/binary/varint.go
>
> package binary
>
> // This file implements "varint" encoding of 64-bit integers.
> // The encoding is:
> // - unsigned integers are serialized 7 bits at a time, starting with the
> // least significant bits
> // - the most significant bit (msb) in each output byte indicates if there
> // is a continuation byte (msb = 1)
> // - signed integers are mapped to unsigned integers using "zig-zag"
> // encoding: Positive values x are written as 2x + 0, negative values
> // are written as 2(^x) + 1; that is, negative numbers are complemented
> // and whether to complement is encoded in bit 0.
> //
> // Design note:
> // At most 10 bytes are needed for 64-bit values. The encoding could
> // be more dense: a full 64-bit value needs an extra byte just to hold bit 63.
> // Instead, the msb of the previous byte could be used to hold bit 63 since we
> // know there can't be more than 64 bits. This is a trivial improvement and
> // would reduce the maximum encoding length to 9 bytes. However, it breaks the
> // invariant that the msb is always the "continuation bit" and thus makes the
> // format incompatible with a varint encoding for larger numbers (say 128-bit).
A classic technique for lossless data compression is Huffman coding where the more common symbols are generally represented using fewer bits than less common symbols. In practice, smaller numbers occur most often. Therefore, if we can represent small numbers efficiently, even if larger numbers are represented less efficiently, we will get lossless data compression.
Uvarints are a method of serializing unsigned integers using one or more bytes. Smaller numbers take a smaller number of bytes. Each byte in a uvarint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the number in groups of 7 bits, least significant group first. We can do this for unsigned integers with any number of bits.
For example,
uint bits : uvarint bytes
7 7f : 1 7f
14 3fff : 2 ff7f
21 1fffff : 3 ffff7f
28 fffffff : 4 ffffff7f
35 7ffffffff : 5 ffffffff7f
42 3ffffffffff : 6 ffffffffff7f
49 1ffffffffffff : 7 ffffffffffff7f
56 ffffffffffffff : 8 ffffffffffffff7f
63 7fffffffffffffff : 9 ffffffffffffffff7f
64 ffffffffffffffff : 10 ffffffffffffffffff01
And so on, for as many uint bits as we want.
If we have a lots of numbers that are represented as 1 to 49 bits of an unsigned 64-bit integer serialized to uvarints of 1 to 7 bytes, we won't care if we have a few numbers that are represented as 57 to 64 bits of an unsigned 64-bit integer serialized to uvarints of 9 to 10 bytes.
References:
答案2
得分: 3
他们这样做的原因是可以将一个64位整数(值为27)编码为一个字节。
这是通过将最高有效位(msb)用来指示流中是否还有另一个字节来实现的。
因此,他们每个字节只能编码7位信息(第8位是继续位)。
这基于一个假设,即大多数整数/长整数实际上是小数。
顺便提一下:Sun在JDK中也做出了同样的假设,当调用Integer.valueOf(123);
时,他们缓存了-128至+127之间的Integer/Long实例,以节省封装数字时的整体堆空间。
英文:
The reason they do this, is they can encode a 64 bit int (value 27) in a single byte.
This is achieved by stealing the msb to indicate if there's another byte in the stream.
So, they can only encode 7 bits of information per byte (8th bit being the continuation)
It's based on the assumption that most integers/longs are actually small numbers.
Side note: this same assumption was made by Sun in the JDK in that they cache -128 - +127 Integer / Long instances to save overall heap space on boxed numbers when folks call Integer.valueOf(123);
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论