英文:
What are some simple patterns that compression algorithms look for?
问题
Compression algorithms seek various patterns in data to reduce its size efficiently. They may identify simple repetitions like "ABBBBC" becoming "AB4C," but they also recognize more complex patterns. For instance, text can be compressed using trees (e.g., Huffman coding), but universal compression algorithms like .zip adapt to different data types, accounting for equal character or symbol probabilities.
英文:
What kinds of patterns in the data do compression algorithms look for?
For example, do they look for repetition: "ABBBBC" = "AB4C"?
Or do they look for more complicated patterns? For example, I know text can be compressed using trees, but it may not work for all data types because some data types may have equal probabilities for each character or symbol. To be clear, I'm wondering about universal compression algorithms like .zip
答案1
得分: 0
zip 的默认压缩方法是 deflate,它通过两种方式寻找冗余。首先,它查找重复的字节字符串,长度为三到 258 字节,最多追溯到 32768 字节。这样的重复字符串被编码为长度(3 到 258)和距离(1 到 32768)。长度大于距离会导致多个复制。例如,长度为 100,距离为 1 时,会重复最后一个字节 100 次。因此,hello, hello, wheeeeeeeee!
变为 hello,
(7,7) whe
(8,1) !
。
在 deflate 中,文字字节(例如 h
),长度(例如 7)和流结束符号被合并成一个符号,即文字/长度。如果它是长度,则后面跟着一个距离符号。
deflate 寻找的第二件事是统计信息。文字/长度符号被计数,频率较高的符号用较少的位数编码,而频率较低的符号用更多的位数编码。距离码也是同样处理的。使用霍夫曼编码来最优地完成这个过程。例如,在英文的字母分布中,e
可能用三位编码,q
可能用十位编码。
deflate 还有一些技巧:
- 较长的长度和距离被编码为每个覆盖此类值范围的符号,后跟足够的额外位来选择该范围中的哪个值。
- 输入被分成小块(大约每个几十 K 一个),以使霍夫曼码能够适应数据中不断变化的统计信息。
- 霍夫曼码需要在每个块的开头定义,并且这些描述本身也经过霍夫曼编码。
- 有 "静态" 块,使用预定义的一组霍夫曼码,以避免对少量数据的情况下进行 #3 中的额外开销。
- 有 "存储" 块,用于不能被 deflate 压缩的数据,以最小化不可压缩数据的扩展。
所有压缩方法都可以看作由建模步骤和熵编码步骤组成。建模步骤利用已知的有关数据中预期冗余的信息来将其建模成不同形式,以紧凑地提取和表达实际发现的冗余。对于有损压缩,该步骤还包括删除重建数据中被视为不重要的信息。熵编码步骤将建模步骤的结果编码为一系列位,理想情况下,这些位能够准确地表示统计信息,以使得每个位几乎等可能地为 0 或 1。
英文:
zip's default compression method, deflate, looks for redundancy in two ways. First, it looks for repeated strings of bytes, from three to 258 bytes in length, as far back as 32768 bytes. Such repeated strings are encoded as a length (3..258) and a distance (1..32768). A length greater than the distance results in more than one copy. E.g. a length of 100 and a distance of 1 repeats the last byte 100 times. So, hello, hello, wheeeeeeeee!
becomes hello,
(7,7) whe
(8,1) !
.
In deflate, the literal bytes, e.g. h
, the lengths, e.g. 7, and end-of-stream symbol, are combined into a single symbol, a literal/length. If it's a length, it is followed by a distance symbol.
The second thing that deflate looks for is statistical. The literal/length symbols are counted, and the more frequent ones are coded in fewer bits, and the less frequent ones are coded in more bits. The same thing is done for the distance codes. This process is done optimally using Huffman coding. For for the distribution of letters in English, an e
might be coded in three bits, and a q
in ten bits.
deflate has a few more tricks:
- Longer lengths and distances are coded as symbols that each cover a range of such values, followed by enough extra bits to pick which one in that range.
- The input is broken into small blocks (on the order of 10's of K each), so that the Huffman codes can adapt to changing statistics in the data.
- The Huffman codes need to be defined at the start of each block, and those descriptions are themselves Huffman coded.
- There are "static" blocks that use a pre-defined set of Huffman codes, to avoid the overhead in #3 for small amounts of data.
- There are "stored" blocks for data that deflate could not compress, to minimize the expansion of incompressible data.
All compression methods can be considered to consist of a modeling step followed by an entropy coding step. The modeling step uses information known about expected redundancies in the data to model it in a different form that extracts and expresses the actual redundancies found compactly. For lossy compression, that step also includes excising information deemed to be unimportant in the reconstruction of the data. The entropy coding step takes the results of the modeling step and codes it into a sequence of bits, ideally representing the statistics accurately enough such that each bit is almost equally likely be a 0 or a 1.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论