如何理解从pickle文件中读取的字节数据的打印结果?

huangapple go评论65阅读模式
英文:

How to understand print result of byte data read from a pickle file?

问题

我正在尝试从pickle文件中获取数据。据我所知,在进行序列化时,数据被转换成字节流。当我使用以下代码将数据以二进制形式读取时:

f = open("alexnet.pth", "rb")
data = f.read()

我得到了这个结果:

> b'PK\x03\x04\x00\x00\x08\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x12\x00archive/data.pklFB\x0e\x00ZZZZZZZZZZZZZZ\x80\x02ccollections\nOrderedDict\nq\x00)Rq\x01(X\x11\x00\x00\x00features.0.weightq\x02ctorch._utils\n_rebuild_tensor_v2\nq\x03((X\x07\x00\x00\x00storageq\x04ctorch\nFloatStorage\nq\x05X\r\x00\x00\x002472041505024q\x06X\x03\x00\x00\x00cpuq\x07M\xc0Ztq\x08QK\x00(K@K\x03K\x0bK\x0btq\t(Mk\x01KyK\x0bK\x01tq\n\x89h\x00)Rq\x0btq\x0cRq\rX\x0f\x00\x00\x00features.0.biasq\x0eh\x03((h\x04h\x05X\r\x00\x00\x002472041504928q\x0fh\x07K@tq\x10QK\x00K@\x85q\x11K\x01\x85q\x12\x89h\x00)Rq\x13tq\x14Rq\x15X\x11\x00\x00\x00features.3.weightq\x16h\x03((h\x04h\x05X\r\x00\x00\x002472041505120q\x17h\x07J\x00\xb0\x04\x00tq\x18QK\x00(K\xc0K@K\x05K\x05tq\x19(M@\x06K\x19K\x05K\x01tq\x1a\x89h\x00)Rq\x1btq\x1cRq\x1dX\x0f\x00\x00\x00features.3.biasq\x1eh\x03((h\x04h\x05X\r\x00\x00\x002472041507136q\x1fh\x07K\xc0tqQK\x00K\xc0\x85q!K\x01\x85q"...

我知道这些都是十六进制字符我的问题是1个字节是否包含1个十六进制字符每个"\\"代表1个字节)?或者如何以字节的术语来阅读这些数据此外我注意到一些英文单词"\x02ctorch._utils""n_rebuild_tensor_v2"它们的意思是什么十六进制+字符串)?

<details>
<summary>英文:</summary>

I am trying to get data from pickle file. As I know, when we do serialization, the data is converted into byte stream. When I read the data as binary using this code:

    f = open(&quot;alexnet.pth&quot;, &quot;rb&quot;)
    data = f.read()

I got this result

&gt; b&#39;PK\x03\x04\x00\x00\x08\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x12\x00archive/data.pklFB\x0e\x00ZZZZZZZZZZZZZZ\x80\x02ccollections\nOrderedDict\nq\x00)Rq\x01(X\x11\x00\x00\x00features.0.weightq\x02ctorch._utils\n_rebuild_tensor_v2\nq\x03((X\x07\x00\x00\x00storageq\x04ctorch\nFloatStorage\nq\x05X\r\x00\x00\x002472041505024q\x06X\x03\x00\x00\x00cpuq\x07M\xc0Ztq\x08QK\x00(K@K\x03K\x0bK\x0btq\t(Mk\x01KyK\x0bK\x01tq\n\x89h\x00)Rq\x0btq\x0cRq\rX\x0f\x00\x00\x00features.0.biasq\x0eh\x03((h\x04h\x05X\r\x00\x00\x002472041504928q\x0fh\x07K@tq\x10QK\x00K@\x85q\x11K\x01\x85q\x12\x89h\x00)Rq\x13tq\x14Rq\x15X\x11\x00\x00\x00features.3.weightq\x16h\x03((h\x04h\x05X\r\x00\x00\x002472041505120q\x17h\x07J\x00\xb0\x04\x00tq\x18QK\x00(K\xc0K@K\x05K\x05tq\x19(M@\x06K\x19K\x05K\x01tq\x1a\x89h\x00)Rq\x1btq\x1cRq\x1dX\x0f\x00\x00\x00features.3.biasq\x1eh\x03((h\x04h\x05X\r\x00\x00\x002472041507136q\x1fh\x07K\xc0tqQK\x00K\xc0\x85q!K\x01\x85q&quot;\x89h\x00)Rq#tq$Rq%X\x11\x00\x00\x00features.6.weightq&amp;h\x03((h\x04h\x05X\r\x00\x00\x002472041509056q\&#39;h\x07J\x00 \n\x00tq(QK\x00(M\x80\x01K\xc0K\x03K\x03tq)(M\xc0\x06K\tK\x03K\x01tq*\x89h\x00)Rq+tq,Rq-X\x0f\x00\x00\x00features.6.biasq.h\x03((h\x04h\x05X\r\x00\x00\x002472041505312q/h\x07M\x80\x01tq0QK\x00M\x80\x01\x85q1K\x01\x85q2\x89h\x00)Rq3tq4Rq5X\x11\x00\x00\x00features.8.weightq6h\x03((h\x04h\x05X\r\x00\x00\x002472041508192q7h\x07J\x00\x80\r\x00tq8QK\x00(M\x00\x01M\x80\x01K\x03K\x03tq9(M\x80\rK\tK\x03K\x01tq:\x89h\x00)Rq;tq&lt;Rq=X\x0f\x00\x00\x00features.8.biasq&gt;h\x03((h\x04h\x05X\r\


I know those are hexadecimal characters. My question is does 1 byte contain 1 hexadecimal character (every &quot;\\&quot; means 1 byte)? Or how to read this in terms of byte? Also I notice there are some English words such as &quot;\x02ctorch._utils&quot; and &quot;n_rebuild_tensor_v2&quot;. What do they mean (hexadecimal + string)?

</details>


# 答案1
**得分**: 1

&gt; 1字节是否包含1个十六进制字符每个\代表1字节)?

技术上1字节可以用0到255之间的数字表示通常用两个十六进制字符表示为00到FF用Python表示为\x00到\xFF所以在某种程度上每个\\代表一个字节但每个普通字母也是一个字节Python只是选择在字节对应于ASCII可打印字符数字32-126时打印ASCII字符并在不是'ASCII控制字符'>=128时打印'\x__'表示但如果一个字节被打印为字符这并不意味着它在原始数据中是一个字符!(尽管可读的函数名肯定是)。

&gt; 如何以字节为单位阅读这个

如果你知道字节应该表示什么int16int32floatcharasciiutf-8),你可以使用Python的[struct][1]模块进行转换否则这种表示法与其他任何一种都一样好

&gt; 我还注意到有一些英文单词\x02ctorch._utilsn_rebuild_tensor_v2”。它们是什么意思十六进制+字符串)?

正如前面提到的这些只是在数据中以ASCII或UTF-8在这种情况下没有区别编码的字符串前面的不可打印字节可能是数据之前的一部分没有办法确定除非了解这个特定格式

正如其他人提到的通过查看这些数据你不会获得太多信息编写这些文件的代码在[这里][2]这里有很多pickling和zipping这使原始数据变得更加混乱

但总是好奇心大发
[1]: https://docs.python.org/3/library/struct.html
[2]: https://pytorch.org/docs/stable/_modules/torch/serialization.html#save

<details>
<summary>英文:</summary>

&gt; does 1 byte contain 1 hexadecimal character (every &quot;\&quot; means 1 byte)? 

Technically, 1 byte can be represented by a number between 0 and 255, which is often represented by two hexadecimal character from 00 to FF, expressed in python as \x00 to \xFF. So yes, in a sense every &quot;\\&quot; means one byte, but every &#39;normal&#39; letter is a byte too. Python just chooses to print the ASCII character if the byte corresponds to a printable character in ASCII (numbers 32-126), and the &#39;\x__&#39;-representation if it doesn&#39;t (&#39;ASCII control character&#39; or &gt;=128). But if a byte is printed as a character, that doesn&#39;t mean it was meant to be a character in the original data! (Although the readable function names surely are).

&gt; How to read this in terms of byte? 

If you know what the byte is supposed to represent (int16, int32, float, char, ascii, utf-8, ...), you can convert them with Pythons [struct][1] module. Otherwise this representation is a good as any other.

&gt; Also I notice there are some English words such as &quot;\x02ctorch._utils&quot; and &quot;n_rebuild_tensor_v2&quot;. What do they mean (hexadecimal + string)?

As mentioned, these are just these strings encoded in the data as ASCII (or UTF-8, no difference in this case). The non-printable byte in front is probably part of the data that comes before, there is no way to know for sure without knowing this particular format.

As others have mentioned, there is not much to gain here by poking around this data. The code that writes these files is [here][2]. There is a lot of pickling and zipping going on, which mangles the original data even further.

But its always good to poke around!


  [1]: https://docs.python.org/3/library/struct.html
  [2]: https://pytorch.org/docs/stable/_modules/torch/serialization.html#save

</details>



huangapple
  • 本文由 发表于 2023年4月13日 20:24:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76005396.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定