如何将嵌入的 .bin 文件从Word文档转换为其原始 .msg 格式?

huangapple go评论123阅读模式
英文:

How to convert an embedded .bin file from a Word document to its original .msg format?

问题

我目前正在编写一些代码,以使用Python从Word文档中提取各种嵌入的文件,但我特别困扰于如何将提取的oleObject.bin文件还原为其原始(可用).msg格式的嵌入Outlook .msg文件。有人知道如何做到这一点吗?

恢复PDF文件非常简单,zipfile库内置了处理.bin形式的zip文件的工具,但是对于这些.msg文件,我确实在研究如何从所有添加的二进制数据中剔除原始文件时感到困惑。对此的任何帮助或想法将不胜感激!

我基本上想做与此问题相同的事情,但针对.msg文件而不是PDF文件:https://stackoverflow.com/questions/56587082/how-can-i-decode-a-bin-into-a-pdf

编辑:当我尝试仅将.bin的文件扩展名更改为.msg时,我收到的错误如下

英文:

I'm currently putting together some code to extract a variety of files that are embedded in a Word document using Python, but I'm having particular trouble figuring out how to restore an embedded Outlook .msg file back to its original (usable) .msg form after extracting it as an oleObject.bin file. Does anyone have an idea how to do this?

It's pretty straight forward to restore PDF files and the zipfile library has built in tools to deal with zip files in .bin form, but I'm really scratching my head on these .msg files. I can't find a way to carve out the original file from all the added binary data. Any help or thoughts on this would be appreciated!

I essentially want to do the same thing as this question but for .msg files instead of PDFs: https://stackoverflow.com/questions/56587082/how-can-i-decode-a-bin-into-a-pdf

Edit: This is the error I get when I try to just rename the file extension of the .bin to .msg

答案1

得分: 0

OLE对象,如果被正确嵌入(而非链接),与其源文件完全相同。因此,您可以在其应用程序中运行它们,并从该应用程序中保存它们。因此,文本将保存在记事本中。Zip文件不需要保存,因为它是一个文件夹,只需从其临时位置移动即可。至于MSG文件,如果信任它可以打开,可以从Outlook中保存它。

如果您没有Outlook,也可以在记事本中打开它(但只能保存为纯文本和包含的RTF格式)。在这里,我们看到来自我到您的传真示例条目,附带消息“Hello World!”

如果我们保存RTF文件,可以在WordPad中看到RTF正文内容(因此可以使用“写/PT ....”自动打印为PDF)

如果要提取所有的二进制数据,请使用TAR -xf来解压.docX

这些将包括(正如您观察到的)来自另一个问题中的标题和尾部。当然,您不会知道哪个是哪个,除非查看并删除头部/尾部,但Zip文件将以“PK”开头

.MSG文件将以DOC签名开头

MSG文件的开头将标有“ÐÏ à”,在十六进制中应该是类似于“D0 cF 11 e0”,即它是一个“DocFile”。

消息的末尾有16位的FEFF FFFF ... 填充,所以末尾会说

二进制数据有更多数据,因此该块的末尾会有脏的16位文件名和路径

不确定在某些情况下“T”是否重要,或者只是缓冲碎片,因此您需要检查。

英文:

OLE Objects, If correctly embedded (not linked) are simply all the same as their source. So you can run them in their application and save them from that application. Thus the text will save in Notepad. The Zip will not need save as its a folder thus simply needs MOVE from its temporary location. And for a MSG it will be saveable from Outlook if you trust it to open.

如何将嵌入的 .bin 文件从Word文档转换为其原始 .msg 格式?

If you don't have Outlook it can open in NotePad too (but will only be salvageable as plain text AND RTF if included). Here we see the Fax Sample entry from Me to You with complimentary message Hello World!

如何将嵌入的 .bin 文件从Word文档转换为其原始 .msg 格式?
If we save the RTF we can see the RTF body content in WordPad (and thus auto-print to PDF using Write /PT ....)
如何将嵌入的 .bin 文件从Word文档转换为其原始 .msg 格式?

If you want to pull all the bins use TAR -xf to unpack the .docX

hello - docx.zip\word\embeddings
如何将嵌入的 .bin 文件从Word文档转换为其原始 .msg 格式?

These will include (as you observed) from another question, headings and trailers. Of course you will not know which is which, without look inside and remove the header/trailer but a Zip will start with PK
如何将嵌入的 .bin 文件从Word文档转换为其原始 .msg 格式?

A .MSG will start with the DOC signature
如何将嵌入的 .bin 文件从Word文档转换为其原始 .msg 格式?

The start of a MSG file will be marked with ÐÏ à
which in hex should be something like D0 cF 11 e0 i.e its a "DocFile"

the end of a msg has 16 bit FEFF FFFF ... padding so ends say
þÿÿÿýÿÿÿÿÿÿÿÿ ...lots more ÿÿ ... ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
The bin has more data so the end of that block is dirty with 16bit filename and path
ÿÿÿÿÿÿÿÿT C : \ U s e r s \ n a m e \ A p p D a t a \ L o c a l \ T e m p \ { A 0 9 5 A 1 6 4 - 2 B 3 6 - 4 9 0 5 - A 2 9 4 - E 5 B C C B 9 5 B 9 B 5 } \ H e l l o ( 2 ) . m s g H e l l o . m s g C : \ U s e r s \ n a m e \ D o c u m e n t s \ H e l l o . m s g

unsure if the T is significant in some cases or just buffer debris so you need to check.

答案2

得分: 0

要关闭这个问题,正如KJ所说,.bin文件中的实际.msg文件内容将以字节\xd0\xcf\x11\xe0(具体是该字节序列的第二个实例)开头。

我进行了一些测试,看起来.bin文件在末尾添加的页脚填充以[SomeRandomByte]\x00\x00\x00C\x00:开头。该序列的第一个字节似乎是可变的,所以我在删除其他所有内容后将其删除。

我能够通过从第二个\xd0\xcf\x11\xe0序列开始,并在切掉包括[SomeRandomByte]\x00\x00\x00C\x00:序列在内的所有内容后找到内容。

英文:

To close this out, as KJ stated, the actual .msg file content in the .bin file will start with the bytes \xd0\xcf\x11\xe0 (specifically the second instance of that sequence of bytes).

I did some testing, and it looks like the footer padding added by the .bin file at the end begins with [SomeRandomByte]\x00\x00\x00C\x00:. The first byte of that sequence appears to be variable, so I just delete it after removing everything else.

I was able to find the contents by starting with the second \xd0\xcf\x11\xe0 sequence and ending by chopping off everything after and including the [SomeRandomByte]\x00\x00\x00C\x00: sequence.

huangapple
  • 本文由 发表于 2023年6月9日 10:00:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76436736.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定