英文:
How to understand print result of byte data read from a pickle file?
问题
我正在尝试从pickle文件中获取数据。据我所知,在进行序列化时,数据被转换成字节流。当我使用以下代码将数据以二进制形式读取时:
f = open("alexnet.pth", "rb")
data = f.read()
我得到了这个结果:
> b'PK\x03\x04\x00\x00\x08\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x12\x00archive/data.pklFB\x0e\x00ZZZZZZZZZZZZZZ\x80\x02ccollections\nOrderedDict\nq\x00)Rq\x01(X\x11\x00\x00\x00features.0.weightq\x02ctorch._utils\n_rebuild_tensor_v2\nq\x03((X\x07\x00\x00\x00storageq\x04ctorch\nFloatStorage\nq\x05X\r\x00\x00\x002472041505024q\x06X\x03\x00\x00\x00cpuq\x07M\xc0Ztq\x08QK\x00(K@K\x03K\x0bK\x0btq\t(Mk\x01KyK\x0bK\x01tq\n\x89h\x00)Rq\x0btq\x0cRq\rX\x0f\x00\x00\x00features.0.biasq\x0eh\x03((h\x04h\x05X\r\x00\x00\x002472041504928q\x0fh\x07K@tq\x10QK\x00K@\x85q\x11K\x01\x85q\x12\x89h\x00)Rq\x13tq\x14Rq\x15X\x11\x00\x00\x00features.3.weightq\x16h\x03((h\x04h\x05X\r\x00\x00\x002472041505120q\x17h\x07J\x00\xb0\x04\x00tq\x18QK\x00(K\xc0K@K\x05K\x05tq\x19(M@\x06K\x19K\x05K\x01tq\x1a\x89h\x00)Rq\x1btq\x1cRq\x1dX\x0f\x00\x00\x00features.3.biasq\x1eh\x03((h\x04h\x05X\r\x00\x00\x002472041507136q\x1fh\x07K\xc0tqQK\x00K\xc0\x85q!K\x01\x85q"...
我知道这些都是十六进制字符。我的问题是,1个字节是否包含1个十六进制字符(每个"\\"代表1个字节)?或者如何以字节的术语来阅读这些数据?此外,我注意到一些英文单词,如"\x02ctorch._utils"和 "n_rebuild_tensor_v2"。它们的意思是什么(十六进制+字符串)?
<details>
<summary>英文:</summary>
I am trying to get data from pickle file. As I know, when we do serialization, the data is converted into byte stream. When I read the data as binary using this code:
f = open("alexnet.pth", "rb")
data = f.read()
I got this result
> b'PK\x03\x04\x00\x00\x08\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x12\x00archive/data.pklFB\x0e\x00ZZZZZZZZZZZZZZ\x80\x02ccollections\nOrderedDict\nq\x00)Rq\x01(X\x11\x00\x00\x00features.0.weightq\x02ctorch._utils\n_rebuild_tensor_v2\nq\x03((X\x07\x00\x00\x00storageq\x04ctorch\nFloatStorage\nq\x05X\r\x00\x00\x002472041505024q\x06X\x03\x00\x00\x00cpuq\x07M\xc0Ztq\x08QK\x00(K@K\x03K\x0bK\x0btq\t(Mk\x01KyK\x0bK\x01tq\n\x89h\x00)Rq\x0btq\x0cRq\rX\x0f\x00\x00\x00features.0.biasq\x0eh\x03((h\x04h\x05X\r\x00\x00\x002472041504928q\x0fh\x07K@tq\x10QK\x00K@\x85q\x11K\x01\x85q\x12\x89h\x00)Rq\x13tq\x14Rq\x15X\x11\x00\x00\x00features.3.weightq\x16h\x03((h\x04h\x05X\r\x00\x00\x002472041505120q\x17h\x07J\x00\xb0\x04\x00tq\x18QK\x00(K\xc0K@K\x05K\x05tq\x19(M@\x06K\x19K\x05K\x01tq\x1a\x89h\x00)Rq\x1btq\x1cRq\x1dX\x0f\x00\x00\x00features.3.biasq\x1eh\x03((h\x04h\x05X\r\x00\x00\x002472041507136q\x1fh\x07K\xc0tqQK\x00K\xc0\x85q!K\x01\x85q"\x89h\x00)Rq#tq$Rq%X\x11\x00\x00\x00features.6.weightq&h\x03((h\x04h\x05X\r\x00\x00\x002472041509056q\'h\x07J\x00 \n\x00tq(QK\x00(M\x80\x01K\xc0K\x03K\x03tq)(M\xc0\x06K\tK\x03K\x01tq*\x89h\x00)Rq+tq,Rq-X\x0f\x00\x00\x00features.6.biasq.h\x03((h\x04h\x05X\r\x00\x00\x002472041505312q/h\x07M\x80\x01tq0QK\x00M\x80\x01\x85q1K\x01\x85q2\x89h\x00)Rq3tq4Rq5X\x11\x00\x00\x00features.8.weightq6h\x03((h\x04h\x05X\r\x00\x00\x002472041508192q7h\x07J\x00\x80\r\x00tq8QK\x00(M\x00\x01M\x80\x01K\x03K\x03tq9(M\x80\rK\tK\x03K\x01tq:\x89h\x00)Rq;tq<Rq=X\x0f\x00\x00\x00features.8.biasq>h\x03((h\x04h\x05X\r\
I know those are hexadecimal characters. My question is does 1 byte contain 1 hexadecimal character (every "\\" means 1 byte)? Or how to read this in terms of byte? Also I notice there are some English words such as "\x02ctorch._utils" and "n_rebuild_tensor_v2". What do they mean (hexadecimal + string)?
</details>
# 答案1
**得分**: 1
> 1字节是否包含1个十六进制字符(每个“\”代表1字节)?
技术上,1字节可以用0到255之间的数字表示,通常用两个十六进制字符表示为00到FF,用Python表示为\x00到\xFF。所以在某种程度上,每个“\\”代表一个字节,但每个“普通”字母也是一个字节。Python只是选择在字节对应于ASCII可打印字符(数字32-126)时打印ASCII字符,并在不是('ASCII控制字符'或>=128)时打印'\x__'表示。但如果一个字节被打印为字符,这并不意味着它在原始数据中是一个字符!(尽管可读的函数名肯定是)。
> 如何以字节为单位阅读这个?
如果你知道字节应该表示什么(int16、int32、float、char、ascii、utf-8等),你可以使用Python的[struct][1]模块进行转换。否则,这种表示法与其他任何一种都一样好。
> 我还注意到有一些英文单词,如“\x02ctorch._utils”和“n_rebuild_tensor_v2”。它们是什么意思(十六进制+字符串)?
正如前面提到的,这些只是在数据中以ASCII(或UTF-8,在这种情况下没有区别)编码的字符串。前面的不可打印字节可能是数据之前的一部分,没有办法确定,除非了解这个特定格式。
正如其他人提到的,通过查看这些数据,你不会获得太多信息。编写这些文件的代码在[这里][2]。这里有很多pickling和zipping,这使原始数据变得更加混乱。
但总是好奇心大发!
[1]: https://docs.python.org/3/library/struct.html
[2]: https://pytorch.org/docs/stable/_modules/torch/serialization.html#save
<details>
<summary>英文:</summary>
> does 1 byte contain 1 hexadecimal character (every "\" means 1 byte)?
Technically, 1 byte can be represented by a number between 0 and 255, which is often represented by two hexadecimal character from 00 to FF, expressed in python as \x00 to \xFF. So yes, in a sense every "\\" means one byte, but every 'normal' letter is a byte too. Python just chooses to print the ASCII character if the byte corresponds to a printable character in ASCII (numbers 32-126), and the '\x__'-representation if it doesn't ('ASCII control character' or >=128). But if a byte is printed as a character, that doesn't mean it was meant to be a character in the original data! (Although the readable function names surely are).
> How to read this in terms of byte?
If you know what the byte is supposed to represent (int16, int32, float, char, ascii, utf-8, ...), you can convert them with Pythons [struct][1] module. Otherwise this representation is a good as any other.
> Also I notice there are some English words such as "\x02ctorch._utils" and "n_rebuild_tensor_v2". What do they mean (hexadecimal + string)?
As mentioned, these are just these strings encoded in the data as ASCII (or UTF-8, no difference in this case). The non-printable byte in front is probably part of the data that comes before, there is no way to know for sure without knowing this particular format.
As others have mentioned, there is not much to gain here by poking around this data. The code that writes these files is [here][2]. There is a lot of pickling and zipping going on, which mangles the original data even further.
But its always good to poke around!
[1]: https://docs.python.org/3/library/struct.html
[2]: https://pytorch.org/docs/stable/_modules/torch/serialization.html#save
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论