英文:
Optimal algorithm for encoding data within pre-existing data
问题
以下是您要翻译的内容:
Say we have some existing random data, uniformly distributed, that's been written to some medium. Lets also say that the writing of 1
bits is a destructive action, while for 0
bits it is non-destructive. This could be analogous to punching holes in a punch card or burning a fuse in a circuit to hard-code a 1
. After this initial write has already taken place, lets say we now wish to write additional data to this medium, with some special encoding that will enable us to retrieve this new data in the future, with no special knowledge of pre-existing data that was already destructively written to the medium. The pre-existing data itself does not need to remain retrievable- only the new data.
现在假设我们有一些已经写入某种介质的随机数据,均匀分布,假设写入 1
位是一种 破坏性 操作,而对于 0
位则是 非破坏性 的。这可以类比于在 穿孔卡片 上打孔或在电路中烧毁导线以硬编码 1
。在这一初始写入已经发生后,我们现在希望将额外的数据写入这个介质,使用一些特殊的编码方式,以便将来可以检索这些新数据,而不需要了解已经破坏性地写入介质的现有数据。现有数据本身不需要保持可检索性 - 只需新数据。
Also assume that the new data is itself random, or has at least has already been compressed to be effectively random.
另外假设新数据本身是随机的,或者至少已经被压缩成了有效随机数据。
Obviously we cannot expect to exceed ~50% of the capacity of the original storage, since only roughly half will be writable. The trouble is trying to push the efficiency as close to this limit as possible.
显然,我们不能期望超过原始存储容量的 ~50%,因为只有大约一半是可写的。困难在于尝试将效率尽可能接近这个极限。
I have two fairly simple encodings, with seemingly reasonable efficiencies, however they do not appear to be optimal. For the ease of explaining, I will assume the medium is a paper tape with holes (denoting a 1
) or gaps (denoting a 0
) at regular intervals.
我有两种相当简单的编码方式,看起来效率似乎合理,但它们似乎并不是最佳的。为了方便解释,我将假设介质是一种纸带,上面有定期间隔的孔(表示 1
)或间隙(表示 0
)。
Encoding A
Encode 1
bits with a gap
at an even offset along the tape, and 0
bits with a gap
at an odd offset.
编码 A
使用磁带上 偶数 偏移处的 间隙
来编码 1
位,使用磁带上 奇数 偏移处的 间隙
来编码 0
位。
Tape: G G H H H H H G H H H G H H H H G H H H
| | | | |
New Data: 1 0 0 0 1
Encoding B
Encode 1
bits with a gap
followed by another gap
, and 0
bits with a gap
followed by a hole
.
编码 B
使用 间隙
后面跟着另一个 间隙
来编码 1
位,使用 间隙
后面跟着一个 孔
来编码 0
位。
Tape: H G G H H H H G H H H G G H H H G H G H
\/ \/ \/ \/ \/
New Data: 1 0 1 0 0
Both of these, on average, can encode 25% of the original storage capacity's worth of data. This seems pretty reasonable, at least to me- only half of the tape to begin with were gap
s, and any hole
s we write lose information (since we cannot know if they pre-existed or not when we later attempt to decode), so a quarter- 25%. Seems at first like it could be optimal.
这两种编码方式,在 平均情况下,可以编码原始存储容量的 25% 数据。这看起来相当合理,至少对我来说- 磁带的一半一开始就是 间隙
,我们写入的任何 孔
都会丢失信息(因为我们无法确定它们是否在后来的解码时已经存在),所以四分之一- 25%。起初看起来这可能是最佳选择。
However, what's bothering me is that it seems like this is not in fact the case. I have a somewhat more complex encoding that consistently breaks this threshold.
然而,令我困扰的是,似乎事实并非如此。我有一种相对复杂的编码方式,始终超过了这个阈值。
Encoding C
At its core, we encode 1
bits as runs of hole
s of odd length, and 0
bits as runs of hole
s of even
英文:
Say we have some existing random data, uniformly distributed, that's been written to some medium. Lets also say that the writing of 1
bits is a destructive action, while for 0
bits it is non-destructive. This could be analogous to punching holes in a punch card or burning a fuse in a circuit to hard-code a 1
. After this initial write has already taken place, lets say we now wish to write additional data to this medium, with some special encoding that will enable us to retrieve this new data in the future, with no special knowledge of pre-existing data that was already destructively written to the medium. The pre-existing data itself does not need to remain retrievable- only the new data.
Also assume that the new data is itself random, or has at least has already been compressed to be effectively random.
Obviously we cannot expect to exceed ~50% of the capacity of the original storage, since only roughly half will be writable. The trouble is trying to push the efficiency as close to this limit as possible.
I have two fairly simple encodings, with seemingly reasonable efficiencies, however they do not appear to be optimal. For the ease of explaining, I will assume the medium is a paper tape with holes (denoting a 1
) or gaps (denoting a 0
) at regular intervals.
Encoding A
Encode 1
bits with a gap
at an even offset along the tape, and 0
bits with a gap
at an odd offset.
Offset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Tape: G G H H H H H G H H H G H H H H G H H H
| | | | |
New Data: 1 0 0 0 1
Encoding B
Encode 1
bits with a gap
followed by another gap
, and 0
bits with a gap
followed by a hole
.
Offset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Tape: H G G H H H H G H H H G G H H H G H G H
\/ \/ \/ \/ \/
New Data: 1 0 1 0 0
Both of these, on average, can encode 25% of the original storage capacity's worth of data. This seems pretty reasonable, at least to me- only half of the tape to begin with were gap
s, and any hole
s we write lose information (since we cannot know if they pre-existed or not when we later attempt to decode), so a quarter- 25%. Seems at first like it could be optimal.
However, what's bothering me is that it seems like this is not in fact the case. I have a somewhat more complex encoding that consistently breaks this threshold.
Encoding C
At its core, we encode 1
bits as runs of hole
s of odd length, and 0
bits as runs of hole
s of even length. So HHG
would represent 0
while HG
or HHHG
would represent 1
.
This by itself is not enough to cross the 25% threshold however, so we add one optimization. When we see two successive gap
s after a run of hole
s and its terminating gap
, we treat those gaps
as a 0
. The logic behind this is that if we had wanted to encode a 1
, we could have just punched out the first of the two gaps, producing HG
, encoding a 1
. Since we did not do so, we can assume that we needed to encode a 0
instead.
With this optimization, encoding C reaches a stable 26.666 +/- 0.005 storage efficiency, consistently above the theoretical 25%. I also have a handful of really tricky peephole optimizations that I can tack onto this encoding, which I suspect would push its efficiency up to 28-30%, but it feels like I'm already overthinking this.
Encoding D?
This isn't really an encoding, but one last thought I had, which can improve any of the above encodings. Consider an encoding E(x)
, and some deterministic + reversible transformation T(x)
which mutates the data in some arbitrary fashion. Say we prepend a 0
bit to our data-to-be-encoded, and encode it- E('0' + DATA)
. Say we also mutate our data, prepend a 1
bit, and then encode it- E('1' + T(DATA))
. We could then compare the two, see which happened to encode more data (greater efficiency), and choose that one. The improvement would be small overall, hinging on statistical variation, and we did sacrifice a bit as an indicator, but as long as the data is large enough, the average savings will outweigh the single bit lost. This could be generalized- partitioning the data into a few partitions and choosing between 2 or more encodings, whichever happened to randomly fit best, but that's beside the point. Overall, the improvement would be small, but it should still be an straight improvement- indicating that the base encoding E(x)
is not optimal.
To recap- I'm looking for the most efficient (in the average case) lossless encoding for writing data to a medium that's already been semi-destructively (1's only) written to with pure random data. I really believe the optimal solution is somewhere between 30-50% efficiency, but I'm struggling. I hope someone can either share what the optimal encoding is, or shed some light on the relevant literature / theory around this topic.
Side note: In the process of trying to get a better understanding of this problem, I tried to create a less efficient algorithm that bounds the worst-case encoding efficiency anywhere above 0%, but failed. It seems like no matter the encoding I tried, even if half of the storage were guaranteed to be writable, in an astronomically unlikely event: the ordering of the pre-existing data can ensure that we're unable to encode even a single bit of the new data. This isn't really a concern for the actual problem statement, since I'm concerned with the average-case efficiency, but this was unsettling.
答案1
得分: 2
I suspect that the expected capacity approaches 50% in the limit as the
number of bits n → ∞.
The encoding algorithm that I have in mind uses linear algebra over the
finite field F2. Ahead of time, choose ε > 0 and a random
matrix A of dimension (1 − ε) n/2 × n. To decode a vector x, if x is not
all ones, then return the matrix-vector product A (1 − x); otherwise,
fail. To encode a vector b, use Gaussian elimination to solve for
nonzero x′ in A′ (1 − x′) = b, where A′ omits the columns corresponding
to preexisting one bits and x′ omits the rows corresponding to
preexisting one bits. If there is no solution, punch a lace
card.
I don’t have time to write and verify a formal proof, but my intuition
is that, in taking sub-matrices of A, the probability that we encounter
linear dependence decreases very fast as ε edges away from zero.
I implemented a simulation of a more practical version of this algorithm
in C++ below (which could be extended to an encoder/decoder pair without
too much trouble). Instead of being all or nothing, it determines the
longest prefix that it can control and uses the block as a 6-bit length
followed by that amount of data, realizing ≈38% of the original storage.
The length prefix is eating about 9% here (we had control of ≈47% of the
bits), but it approaches 0% in the large-n limit. Even with 64-bit
blocks, you could do better by Huffman coding the lengths.
(You might wonder what happens if we can’t even control all of the
length bits. We can set things up so that the lace card decodes to
length 0, which implies that it is skipped, then punch a lace card
whenever that has to happen.)
英文:
I suspect that the expected capacity approaches 50% in the limit as the
number of bits n → ∞.
The encoding algorithm that I have in mind uses linear algebra over the
finite field F<sub>2</sub>. Ahead of time, choose ε > 0 and a random
matrix A of dimension (1 − ε) n/2 × n. To decode a vector x, if x is not
all ones, then return the matrix-vector product A (1 − x); otherwise,
fail. To encode a vector b, use Gaussian elimination to solve for
nonzero x′ in A′ (1 − x′) = b, where A′ omits the columns corresponding
to preexisting one bits and x′ omits the rows corresponding to
preexisting one bits. If there is no solution, punch a lace
card.
I don’t have time to write and verify a formal proof, but my intuition
is that, in taking sub-matrices of A, the probability that we encounter
linear dependence decreases very fast as ε edges away from zero.
I implemented a simulation of a more practical version of this algorithm
in C++ below (which could be extended to an encoder/decoder pair without
too much trouble). Instead of being all or nothing, it determines the
longest prefix that it can control and uses the block as a 6-bit length
followed by that amount of data, realizing ≈38% of the original storage.
The length prefix is eating about 9% here (we had control of ≈47% of the
bits), but it approaches 0% in the large-n limit. Even with 64-bit
blocks, you could do better by Huffman coding the lengths.
(You might wonder what happens if we can’t even control all of the
length bits. We can set things up so that the lace card decodes to
length 0, which implies that it is skipped, then punch a lace card
whenever that has to happen.)
#include <algorithm>
#include <cstddef>
#include <iostream>
#include <limits>
#include <optional>
#include <random>
#include <vector>
namespace {
static std::random_device g_dev;
class Vector {
public:
explicit Vector() = default;
static Vector Random() {
return Vector{std::uniform_int_distribution<Bits>{
0, std::numeric_limits<Bits>::max()}(g_dev)};
}
Vector &operator^=(const Vector v) {
bits_ ^= v.bits_;
return *this;
}
bool operator[](const std::size_t i) const { return bits_ & (Bits{1} << i); }
private:
using Bits = unsigned long long;
explicit Vector(const Bits bits) : bits_{bits} {}
Bits bits_ = 0;
};
class Matrix {
public:
static Matrix Random(const std::size_t num_rows) {
Matrix a;
a.rows_.reserve(num_rows);
for (std::size_t i = 0; i < num_rows; ++i) {
a.rows_.push_back(Vector::Random());
}
return a;
}
Matrix RandomSubsetOfRows(const double p) const {
Matrix a;
for (const Vector row : rows_) {
if (std::bernoulli_distribution{p}(g_dev)) {
a.rows_.push_back(row);
}
}
return a;
}
std::size_t NumControllablePrefixBits() && {
for (std::size_t j = 0; true; ++j) {
const auto pivot = [&]() -> std::optional<std::size_t> {
for (std::size_t i = j; i < rows_.size(); ++i) {
if (rows_[i][j]) {
return i;
}
}
return std::nullopt;
}();
if (!pivot) {
return j;
}
std::swap(rows_[j], rows_[*pivot]);
for (std::size_t i = 0; i < rows_.size(); ++i) {
if (i != j && rows_[i][j]) {
rows_[i] ^= rows_[j];
}
}
}
}
private:
std::vector<Vector> rows_;
};
} // namespace
int main() {
static constexpr std::size_t kBlocks = 10000;
static constexpr std::size_t kLogBlockSize = 6;
static constexpr std::size_t kBlockSize = 1 << kLogBlockSize;
std::size_t num_usable_bits = 0;
const Matrix a = Matrix::Random(kBlockSize);
for (int i = 0; i < kBlocks; ++i) {
num_usable_bits +=
a.RandomSubsetOfRows(0.5).NumControllablePrefixBits() - kLogBlockSize;
}
std::cout << static_cast<double>(num_usable_bits) / (kBlocks * kBlockSize)
<< "\n";
}
答案2
得分: 2
我用Python实现了完整的算法。我通过一个单位矩阵来增广矩阵,类似于矩阵求逆算法,但将与空位对应的列置零。
```python
import math
import random
log_n = 10
n = log_n + ((1 << log_n) - 1)
matrix = [random.randrange(1 << n) for j in range(n)]
def decode(data):
text = 0
for j in range(n):
if (data & (1 << j)) == 0:
text ^= matrix[j]
return text
def invert(existing_data):
sub = [matrix[j] for j in range(n) if (existing_data & (1 << j)) == 0]
inverse = [1 << j for j in range(n) if (existing_data & (1 << j)) == 0]
i = 0
while True:
for k in range(i, len(sub)):
if sub[k] & (1 << i):
break
else:
return inverse[:i]
sub[i], sub[k] = sub[k], sub[i]
inverse[i], inverse[k] = inverse[k], inverse[i]
for k in range(len(sub)):
if k != i and (sub[k] & (1 << i)):
sub[k] ^= sub[i]
inverse[k] ^= inverse[i]
i += 1
def encode(inverse, text):
data = ~((~0) << n)
for i in range(len(inverse)):
if text & (1 << i):
data ^= inverse[i]
return data
def test():
existing_data = random.randrange(1 << n)
inverse = invert(existing_data)
payload_size = max(len(inverse) - log_n, 0)
payload = random.randrange(1 << payload_size)
text = payload_size ^ (payload << log_n)
data = encode(inverse, text)
assert (existing_data & ~data) == 0
decoded_text = decode(data)
decoded_payload_size = decoded_text & (~((~0) << log_n))
decoded_payload = (decoded_text >> log_n) & (~((~0) << payload_size))
assert payload_size == decoded_payload_size
assert payload == decoded_payload
return payload_size / n
print(sum(test() for i in range(100)))
英文:
I implemented the full algorithm in Python. I augment the matrix by an identity matrix, like a matrix inversion algorithm, but zero out the columns that correspond to holes.
import math
import random
log_n = 10
n = log_n + ((1 << log_n) - 1)
matrix = [random.randrange(1 << n) for j in range(n)]
def decode(data):
text = 0
for j in range(n):
if (data & (1 << j)) == 0:
text ^= matrix[j]
return text
def invert(existing_data):
sub = [matrix[j] for j in range(n) if (existing_data & (1 << j)) == 0]
inverse = [1 << j for j in range(n) if (existing_data & (1 << j)) == 0]
i = 0
while True:
for k in range(i, len(sub)):
if sub[k] & (1 << i):
break
else:
return inverse[:i]
sub[i], sub[k] = sub[k], sub[i]
inverse[i], inverse[k] = inverse[k], inverse[i]
for k in range(len(sub)):
if k != i and (sub[k] & (1 << i)):
sub[k] ^= sub[i]
inverse[k] ^= inverse[i]
i += 1
def encode(inverse, text):
data = ~((~0) << n)
for i in range(len(inverse)):
if text & (1 << i):
data ^= inverse[i]
return data
def test():
existing_data = random.randrange(1 << n)
inverse = invert(existing_data)
payload_size = max(len(inverse) - log_n, 0)
payload = random.randrange(1 << payload_size)
text = payload_size ^ (payload << log_n)
data = encode(inverse, text)
assert (existing_data & ~data) == 0
decoded_text = decode(data)
decoded_payload_size = decoded_text & (~((~0) << log_n))
decoded_payload = (decoded_text >> log_n) & (~((~0) << payload_size))
assert payload_size == decoded_payload_size
assert payload == decoded_payload
return payload_size / n
print(sum(test() for i in range(100)))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论