英文:
Compression algorithm for contiguous numbers
问题
I'm looking for an efficient encoding for storing simulated coefficients.
The data has thousands of curves with each 512 contiguous numbers with single precision. The data may be stored as fixed point while it should preserve about 23-bit precision (compared to unity level).
The curves could look like those:
My best approach was to convert the numbers to 24-bit fixed point. Repeatedly I took the adjacent difference as long as the sum-of-squares decreases. When compressing the resulting data using LZMA (xz,lzip) I get about 7.5x compression (compared to float32).
Adjacent differences are good at the beginning, but they emphasize the quantization noise at each turn.
I've also tried the cosine transform after subtracting the slope/curve at the boundaries. The resulting compression was much weaker.
I tried AEC but LZMA compressed much stronger. The highest compression was using bzip3 (after adjacent differences).
I found no function to fit the data with high precision and a limited parameter count.
Is there a way to reduce the penalty of quantization noise when using adjacent differences?
Are there encodings which are better suited for this type of data?
英文:
I'm looking for an efficient encoding for storing simulated coefficients.
The data has thousands of curves with each 512 contiguous numbers with single precision. The data may be stored as fixed point while it should preserve about 23-bit precision (compared to unity level).
The curves could look like those:
My best approach was to convert the numbers to 24-bit fixed point. Repeatedly I took the adjacent difference as long as the sum-of-squares decreases. When compressing the resulting data using LZMA (xz,lzip) I get about 7.5x compression (compared to float32).
Adjacent differences are good at the beginning, but they emphasize the quantization noise at each turn.
I've also tried the cosine transform after subtracting the slope/curve at the boundaries. The resulting compression was much weaker.
I tried AEC but LZMA compressed much stronger. The highest compression was using bzip3 (after adjacent differences).
I found no function to fit the data with high precision and a limited parameter count.
Is there a way to reduce the penalty of quantization noise when using adjacent differences?
Are there encodings which are better suited for this type of data?
答案1
得分: 2
你可以尝试使用高阶预测器。你的"邻近差异"是零阶预测器,其中下一个样本被预测为与上一个样本相等。你获取实际值和预测值之间的差异,然后压缩这些差异。
你可以尝试一阶、二阶等预测器。一阶预测器将查看最后两个样本,连接这两个样本,然后预测下一个样本将落在这条直线上。二阶预测器将查看最后三个样本,拟合它们成一个抛物线,并预测下一个样本将落在这个抛物线上。以此类推。
假设你的样本在x轴上等间隔分布,那么对于x[0]
到三阶预测器的预测如下:
x[-1]
(你现在正在使用的)2*x[-1] - x[-2]
3*x[-1] - 3*x[-2] + x[-3]
4*x[-1] - 6*x[-2] + 4*x[-3] - x[-4]
(请注意,系数是交替符号的二项式系数。)
我怀疑三阶多项式预测器对你来说可能不太有用,但可以尝试所有这些预测器,看看哪一个有帮助。
假设差异很小,你应该使用可变长度的整数来表示它们。想法是大多数时候使用一个字节来编码每个差异。例如,你可以在一个字节中编码七位差异,例如-64到63,高位清零。如果差异不适应这个范围,那么将设置高位,并使用第二个字节来编码另外七位,总共有14位,第二个高位清零。以此类推,对于更大的差异采用相应的编码方式。
英文:
You could try a higher-order predictor. Your "adjacent difference" is a zeroth-order predictor, where the next sample is predicted to be equal to the last sample. You take the differences between the actuals and the predictions, and then compress those differences.
You can try first, second, etc. order predictors. A first-order predictor would look at the last two samples, draw a line between those, and predict that the next sample will fall on the line. A second-order predictor would look at the last three samples, fit those to a parabola, and predict that the next sample will fall on the parabola. And so on.
Assuming that your samples are equally spaced on your x-axis, then the predictors for x[0]
up through cubics are:
x[-1]
(what you're using now)2*x[-1] - x[-2]
3*x[-1] - 3*x[-2] + x[-3]
4*x[-1] - 6*x[-2] + 4*x[-3] - x[-4]
(Note that the coefficients are alternating-sign binomial coefficients.)
I doubt that the cubic polynomial predictor will be useful for you, but experiment with all of them to see if any help.
Assuming that the differences are small, you should use a variable-length integer to represent them. The idea would be to use one byte for each difference most of the time. For example, you could code seven bits of difference, say -64 to 63, in one byte with the high bit clear. If the difference doesn't fit in that, then make the high bit set, and have a second byte with another seven bits for a total of 14 with that second high bit clear. And so on for larger differences.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论