Strange encoding of a pdf stream

huangapple go评论59阅读模式
英文:

Strange encoding of a pdf stream

问题

Sure, here's the translated content:

"我正在研究PDF的内部结构,所以我在LibreOffice Writer中创建了一个文件,只写了字符串“Hello world”,然后将其导出为PDF。然后,我使用以下命令进行解压:pdftk hello_world.pdf output hello_world_unc.pdf uncompress,并在文本编辑器中打开它。

分析流时,我得到了类似以下奇怪的内容:[<01>5<02>-6<03>2<03>2<040506>-2 <040703>2<08>]TJ,这应该表示“Hello world”作为十六进制字符串数组(在尖括号中),以及用整数指定的间距。

我声明文件只包含此字符串,专门用于教育目的。

问题是,它们看起来不像应该的十六进制字符。也就是说,“H”肯定不是用01表示的。
我期望看到类似这样的内容:(Hello world) Tj

有人能帮助我理解吗?提前感谢。"

英文:

I'm studying the internal structure of pdf, so i created a file in libreoffice writer, writing only the string "Hello world" and exported it to pdf. So I uncompressed it with: pdftk hello_world.pdf output hello_world_unc.pdf uncompress and opened it with a text editor.

Analyzing the stream I get something strange like this: [&lt;01&gt;5&lt;02&gt;-6&lt;03&gt;2&lt;03&gt;2&lt;040506&gt;-2 &lt;040703&gt;2&lt;08&gt;]TJ which should represent "Hello world" as an array of hexadecimal strings (in the angle brackets), and integers to specify the spacing.

I state that the file contains only this string, created precisely for educational purposes.

The problem is that they don't look like hexadecimal characters to me as they should be. That is, surely the "H" is not represented with 01.
I was expecting something like this: (Hello world) Tj.

Can anyone help me understand? Thanks in advance

答案1

得分: 0

这些数字只是字符映射表中的索引。

深入研究未压缩的PDF,您会发现一些类似以下的行:

&lt;02&gt; &lt;0065&gt;
&lt;03&gt; &lt;006C&gt;
&lt;04&gt; &lt;006F&gt;
&lt;05&gt; &lt;0020&gt;
&lt;06&gt; &lt;0077&gt;
&lt;07&gt; &lt;0072&gt;
&lt;08&gt; &lt;0064&gt;```

<details>
<summary>英文:</summary>

These numbers are just indexes into the character map.

Investigate the uncompressed PDF deeper. And you will find some lines like these:

<01> <0048>
<02> <0065>
<03> <006C>
<04> <006F>
<05> <0020>
<06> <0077>
<07> <0072>
<08> <0064>


</details>



# 答案2
**得分**: 0

- 字距调整(kerning)正在使用,因此使用了一个TJ数组而不是Tj字符串。这些数字表示以1/1000个字体单位(em)测量的字距;

- &lt;&gt; 字符串是PDF十六进制字符串,而不是普通的PDF字符串;

- 在字体中查找/ToUnicode映射。如果存在,它将帮助你将PDF码点映射到Unicode码点序列。

<details>
<summary>英文:</summary>

- kerning is in use, so a TJ array is being used instead of a Tj string. The numbers are kerns measured in 1/1000 of an em (from memory);

- The &lt;&gt; strings are PDF hex strings, not ordinary PDF strings;

- Look for a /ToUnicode map in the font. If this exists, it will help you with the mapping from PDF code points to sequences of unicode code points.

</details>



huangapple
  • 本文由 发表于 2023年4月1日 00:20:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75900720.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定