英文:
Generating a Huffman code that never produces the string "00" upon encoding
问题
我正在尝试使用赫夫曼编码来为一组符号创建最优编码。然而,有一个限制条件,即编码中不能包含字符串"00"。例如,编码'A' = '0'和'B' = '10'不满足该限制,因为字符串'BA'的编码为'100',其中包含"00"子串。
这意味着编码词也不能包含字符串"00"。例如,编码'A' = '1',B = '00',和C = '01'也不满足该限制,因为编码'B'将始终导致编码中出现'00'。
我尝试修改维基百科上找到的赫夫曼编码算法:
- 为每个符号创建一个叶节点并将其添加到优先队列中。
- 在队列中有多于一个节点时:
- 从队列中删除两个优先级最高(概率最低)的节点。
- 如果两个节点都不是叶节点,则选择优先级最高的节点和优先级次高的叶节点。这确保了所选节点中至少有一个是叶节点。
- 使用这两个节点作为子节点创建一个新的内部节点,其概率等于这两个节点的概率之和。
- 如果一个节点不是叶节点,则将该节点作为新内部节点的右子节点(将其编码为'1')。这可以避免创建"00"子串。
- 将新节点添加到队列中。
- 从队列中删除两个优先级最高(概率最低)的节点。
- 剩下的节点是根节点,树已完成。
- 在所有编码的开头添加'1',以避免在两个相邻符号编码时出现"00"子串。
还有一种情况,即队列中仅剩下两个非叶节点的情况。我不确定如何解决这个问题。除此之外,我认为这可以创建一个满足约束条件的编码,但我不确定它是否是最优的,我希望能够确定。
英文:
I am trying to use Huffman coding to create an optimal coding for a set of symbols. However, a constraint is placed on the encodings such that no encoding contains the string, "00".
For example, the encoding 'A' = '0' and 'B' = '10' would not satisfy the constraint because the string 'BA' encodes to '100', which contains the "00" substring.
This means that the code words also cannot contain the string "00". For example, the encoding 'A' = '1', B = '00', and C = '01' would not satisfy the constraint because encoding 'B' would always result in '00' appearing in the encoding.
I have tried modifying the Huffman coding algorithm found on Wikipedia:
- Create a leaf node for each symbol and add it to the priority queue.
- While there is more than one node in the queue:
- Remove the two nodes of highest priority (lowest probability) from the queue
- If both nodes are not leaf nodes, select the highest priority node and the highest priority leaf node. This ensures that at least one of the selected nodes is a leaf node.
- Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
- If one node is not a leaf node make that node the right child of the new internal node (making it a '1' when encoding). This avoids creating the "00" substring.
- Add the new node to the queue.
- Remove the two nodes of highest priority (lowest probability) from the queue
- The remaining node is the root node and the tree is complete.
- Add a '1' to the beginning of all codes to avoid the "00" substring when two adjacent symbols are encoded.
There is also the case where the only two nodes left in the queue are both non-leaf nodes. I am not sure how to solve this problem. Otherwise, I believe this creates a coding that satisfies the constraint, but I am unsure if it is optimal, and I would like to be certain that it is.
答案1
得分: 5
我认为我应该从一个规则开始,即代码中的任何“0”都必须跟随一个“1”。 这满足了不允许包含“00”的代码的约束。 这也避免了在编码两个相邻符号时产生“00”子字符串的问题。
生成的代码树如下所示,其中:
- 红色阴影区域中的节点是包含“00”的代码
- 包含红色X的节点是以“0”结尾的代码
- 绿色节点是可用的有效代码
请注意,因为赫夫曼编码是前缀自由编码,选择其中一个有效代码会消除该节点的所有后代。 例如,选择使用代码“01”会消除树的左侧的所有其他节点。 换句话说,选择“01”使“01”成为一个叶子,并断开了“01”下面的两个连接。
还要注意,内部节点的左子节点的代码比右子节点长,因此具有较低概率的子节点必须连接在左侧。 这当然是必要的。 它被留作一个练习来证明它是足够的。 (如果不足够,那么练习就是提出最佳分配算法。)
英文:
I think I'd start with the rule that any "0" in a code must be followed by a "1". That satisfies the constraint that codes are not allowed to contain "00". It also avoids the problem of a "00" substring being produced when two adjacent symbols are encoded.
The resulting code tree is shown below, where
- the nodes in the red shaded areas are codes that contain "00"
- the nodes containing a red X are codes that end with a "0"
- the green nodes are the available valid codes
Note that because a Huffman code is a prefix-free code, choosing one of the valid codes eliminates all of the descendants of that node. For example, choosing to use the code "01" eliminates all of the other nodes on the left side of the tree. To put it another way, choosing "01" makes "01" a leaf, and breaks the two connections below "01".
Also note that the left child of an interior node will have a longer code than the right child, so the child with lower probability must be connected on the left. That's certainly necessary. It's left as an exercise to prove that it's sufficient. (If not sufficient, then the exercise is to come up with the optimal assignment algorithm.)
答案2
得分: 2
以下是翻译好的部分:
最简单的方法是根本不要涉及哈夫曼编码。相反,对输出进行后处理。
我们将从简单的比特填充开始。在编码时,取您编码的比特流,每当出现 0
时,在其后插入 1
。在解码端,每当看到 0
时,删除下一个比特(这将是插入的 1
)。然后进行正常的哈夫曼解码。
这并不是最优的,但离最优的距离是有界的。您可以通过根据需要在每个节点处交换分支以将较低的概率或权重放在 0 的一侧来减小比特填充的影响。
这种简单的比特填充平均扩展了输入 1.5 倍。
那么,这种简单的比特填充距离最佳有多近呢?事实上,具有不连续出现两个 0
比特的 n 位模式数量是 Fn+2,其中 Fn 是第 n 个斐波那契数。使用这样的序列,我们最多可以编码 log2(Fn+2) 比特。然后,n 比特的最佳扩展比是 n / log2(Fn+2)。在较大的 n 的极限情况下,这是 1 / log2(φ),其中 φ 是黄金比例。这个最佳扩展比是 1.44042。
因此,来自简单比特填充的 1.5 实际上还算不错。它距离最优值仅有 4% 的差距。
但我们可以做得更好。
我们可以使用斐波那契序列,它表示在不重复 0
的情况下每添加一个比特到序列中的可能编码值的数量,来编码输入比特。我们在下面展示了这种方法,尽管为了方便,我们避免了任何重复的 1
,而不是任何重复的 0
。以下是用于固定大小输入和输出的示例 C 代码:
(接下来是 C 代码,请注意,它包含一些特定的编码和解码功能)
输入然后每次编码 87 位,输出为每个输入块的 125 或 126 位,如果 125 位的顶部位置有 1
位,则需要填充一个 0
。
选择 87 和 125 是因为它们是最有效的集合,可以使输出适应 128 位。这给出了一个扩展比 1.44117,接近最优值的 0.05%。还有许多其他选择。输出一次编码一位,并一次解码一位,因此不需要在整数中累积它。然后,我们可以使用 112 位编码为 161 位。或者我们可以限制自己使用 64 位算术并将 62 位转换为 89 位(接近最优值的 0.09%)。
在比特序列的末尾,剩余的比特可以用高零位扩展到 87 位,然后编码的比特将具有不需要发送的高零位。在解码时,用高 0
位填充 125 的最后位。如果不知道解码时要期望多少比特,然后在编码之前在输入中附加一个单独的高 1
位,然后跟随高 0
位。然后在解码时,从末尾开始扫描 0
位以找到第一个 1
位,然后丢弃所有这些位。
这种满足比特流约束的后处理方法可以说优于任何尝试修改哈夫曼编码以使其对不同权重的字母最优的尝试。它不仅允许快速且经过充分测试的哈夫曼算法实现原样使用,而且这将适用于_任何_熵编码方案,无论是哈夫曼、算术还是不对称数字系统。
英文:
The easiest way is to not mess with the Huffman code at all. Instead, post-process the output.
We will start with simple bit stuffing. When encoding, take your coded bit stream and whenever there is a 0
, insert a 1
after it. On the decoding end, whenever you see a 0
, remove the next bit (which will be the 1
that was inserted). Then do the normal Huffman decoding.
This is not optimal, but the departure from optimality is bounded. You can reduce the impact of the bit stuffing by swapping the branches at every node, as needed, to put the lower probabilities or weights on the 0 sides.
This simple bit stuffing expands the input, on average, by a factor of 1.5.
So how close is this simple bit stuffing to optimal? It turns out the the number of possible n-bit patterns that have no occurrences of two 0
bits in a row is F<sub>n+2</sub>, where F<sub>n</sub> is the n<sup>th</sup> Fibonacci number. With such a sequence we can code at most log<sub>2</sub>(F<sub>n+2</sub>) bits. The optimal expansion ratio of n bits is then n / log<sub>2</sub>(F<sub>n+2</sub>). In the limit of large n, that is 1 / log<sub>2</sub>(𝜑), where 𝜑 is the Golden Ratio. That optimal expansion ratio is 1.44042.
So the 1.5 from simple bit stuffing actually isn't too shabby. It's within 4% of optimal.
But we can do better.
We can use the Fibonacci sequence, which represents the number of possible coded values for each bit added to the sequence without repeating 0
s, to code the input bits. We show such an approach below, though for convenience, we avoid any repeating 1
s instead of any repeating 0
s. (Just invert the output to avoid repeating 0
s.) Here is example C code that does that encoding and decoding for a fixed-size input and output:
typedef unsigned __int128 u128_t;
// Encode 87 bits to a 125-bit sequence that does not have two 1 bits in a row
// anywhere. Note that if such sequences are concatenated and the high bit of
// the 125 is a 1, then a 0 bit needs to be appended to make it 126 bits. This
// will occur 38.2% of the time (1 / goldenratio^2). The final expansion ratio
// of this encoding is then 125.382 / 87 = 1.44117. The theoretical optimum
// ratio is 1 / lg(goldenratio) = 1.44042. This encoding gets within 0.05% of
// optimal.
u128_t encode87to125(u128_t n) {
n &= ((u128_t)1 << 87) - 1;
u128_t e = 0;
// Fibonacci numbers 126 and 125. (gcc and clang do not support 128-bit
// literals, so these are assembled, which will occur at compile time.)
u128_t fn = ((u128_t)0x4f88f2 << 64) | 0x877183a413128aa8u,
fnm1 = ((u128_t)0x3127c1 << 64) | 0xed0f4dff88ba1575u;
for (;;) {
// Encode one bit.
e <<= 1;
if (n >= fn) {
e |= 1;
n -= fn;
}
if (fn == 1)
// Done when the start of the Fibonacci sequence (1, 1) reached.
break;
// Compute the Fibonacci number that precedes fnm1, and move fn and
// fnm1 both down one in the sequence.
u128_t fnm2 = fn - fnm1;
fn = fnm1;
fnm1 = fnm2;
}
return e;
}
// Decode a 125-bit value encoded by encode87to125() back to the orignal 87-bit
// value.
u128_t decode125to87(u128_t e) {
// Start at the beginning of the Fibonacci sequence (1, 1).
u128_t d = 0, fn = 1, fnm1 = 1;
for (;;) {
// Decode the low bit of e.
if (e & 1)
d += fn;
e >>= 1;
if (e == 0)
// Done when there are no more 1 bits in e, since nothing more will
// be added to d.
break;
// Advance fnm1 and fn up one spot in the Fibonacci sequence.
u128_t fnp1 = fn + fnm1;
fnm1 = fn;
fn = fnp1;
}
return d;
}
The input is then encoded 87 bits at a time, and the output is 125 or 126 bits for each input block, the latter if the 125 bits has a 1
bit in the top position, in which case a 0
needs to be stuffed.
The values 87 and 125 were picked since they are the most efficient set that permits the output to fit in 128 bits. This gives an expansion ratio of 1.44117, within 0.05% of optimal. Many other choices are possible. The output is encoded a bit at a time, and decoded a bit at a time, so there is no need to accumulate it in an integer. Then we could go to 112 bits encoded in 161 bits. Or we could limit ourselves to 64-bit arithmetic and convert 62 bits to 89 bits (within 0.09% of optimal).
At the end of the bit sequence, the remaining bits can be extended to 87 bits with high zeros, and the encoded bits will then have high zeros that don't need to be sent. When decoding, fill out the last bits of the 125 with high 0
s. If you don't know how many bits to expect when decoding, then append a single high 1
bit to the input before encoding, followed by high 0
bits. Then when decoding, scan back from the end through the 0
bits to get to the first 1
bit, and then discard all of those.
This post-processing approach to meet the constraints on the bit stream is arguably preferable to any attempt at modifying the Huffman code to be optimal for different-weight letters. Not only does it permit fast and well-tested Huffman algorithm implementations to be used as is, but also this will work with any entropy coding scheme, be it Huffman, arithmetic, or asymmetric numeral systems.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论