英文:
How to convert an embedded .bin file from a Word document to its original .msg format?
问题
我目前正在编写一些代码,以使用Python从Word文档中提取各种嵌入的文件,但我特别困扰于如何将提取的oleObject.bin文件还原为其原始(可用).msg格式的嵌入Outlook .msg文件。有人知道如何做到这一点吗?
恢复PDF文件非常简单,zipfile库内置了处理.bin形式的zip文件的工具,但是对于这些.msg文件,我确实在研究如何从所有添加的二进制数据中剔除原始文件时感到困惑。对此的任何帮助或想法将不胜感激!
我基本上想做与此问题相同的事情,但针对.msg文件而不是PDF文件:https://stackoverflow.com/questions/56587082/how-can-i-decode-a-bin-into-a-pdf
编辑:当我尝试仅将.bin的文件扩展名更改为.msg时,我收到的错误如下
英文:
I'm currently putting together some code to extract a variety of files that are embedded in a Word document using Python, but I'm having particular trouble figuring out how to restore an embedded Outlook .msg file back to its original (usable) .msg form after extracting it as an oleObject.bin file. Does anyone have an idea how to do this?
It's pretty straight forward to restore PDF files and the zipfile library has built in tools to deal with zip files in .bin form, but I'm really scratching my head on these .msg files. I can't find a way to carve out the original file from all the added binary data. Any help or thoughts on this would be appreciated!
I essentially want to do the same thing as this question but for .msg files instead of PDFs: https://stackoverflow.com/questions/56587082/how-can-i-decode-a-bin-into-a-pdf
Edit: This is the error I get when I try to just rename the file extension of the .bin to .msg
答案1
得分: 0
OLE对象,如果被正确嵌入(而非链接),与其源文件完全相同。因此,您可以在其应用程序中运行它们,并从该应用程序中保存它们。因此,文本将保存在记事本中。Zip文件不需要保存,因为它是一个文件夹,只需从其临时位置移动即可。至于MSG文件,如果信任它可以打开,可以从Outlook中保存它。
如果您没有Outlook,也可以在记事本中打开它(但只能保存为纯文本和包含的RTF格式)。在这里,我们看到来自我到您的传真示例条目,附带消息“Hello World!”
如果我们保存RTF文件,可以在WordPad中看到RTF正文内容(因此可以使用“写/PT ....”自动打印为PDF)
如果要提取所有的二进制数据,请使用TAR -xf来解压.docX
这些将包括(正如您观察到的)来自另一个问题中的标题和尾部。当然,您不会知道哪个是哪个,除非查看并删除头部/尾部,但Zip文件将以“PK”开头
.MSG文件将以DOC签名开头
MSG文件的开头将标有“ÐÏ à”,在十六进制中应该是类似于“D0 cF 11 e0”,即它是一个“DocFile”。
消息的末尾有16位的FEFF FFFF ... 填充,所以末尾会说
二进制数据有更多数据,因此该块的末尾会有脏的16位文件名和路径
不确定在某些情况下“T”是否重要,或者只是缓冲碎片,因此您需要检查。
英文:
OLE Objects, If correctly embedded (not linked) are simply all the same as their source. So you can run them in their application and save them from that application. Thus the text will save in Notepad. The Zip will not need save as its a folder thus simply needs MOVE from its temporary location. And for a MSG it will be saveable from Outlook if you trust it to open.
If you don't have Outlook it can open in NotePad too (but will only be salvageable as plain text AND RTF if included). Here we see the Fax Sample entry from Me to You with complimentary message Hello World!
If we save the RTF we can see the RTF body content in WordPad (and thus auto-print to PDF using Write /PT ....
)
If you want to pull all the bins use TAR -xf to unpack the .docX
hello - docx.zip\word\embeddings
These will include (as you observed) from another question, headings and trailers. Of course you will not know which is which, without look inside and remove the header/trailer but a Zip will start with PK
A .MSG will start with the DOC signature
The start of a MSG file will be marked with ÐÏ à
which in hex should be something like D0 cF 11 e0
i.e its a "DocFile"
the end of a msg has 16 bit FEFF FFFF ... padding so ends say
þÿÿÿýÿÿÿÿÿÿÿÿ ...lots more ÿÿ ... ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
The bin has more data so the end of that block is dirty with 16bit filename and path
ÿÿÿÿÿÿÿÿT C : \ U s e r s \ n a m e \ A p p D a t a \ L o c a l \ T e m p \ { A 0 9 5 A 1 6 4 - 2 B 3 6 - 4 9 0 5 - A 2 9 4 - E 5 B C C B 9 5 B 9 B 5 } \ H e l l o ( 2 ) . m s g H e l l o . m s g C : \ U s e r s \ n a m e \ D o c u m e n t s \ H e l l o . m s g
unsure if the T
is significant in some cases or just buffer debris so you need to check.
答案2
得分: 0
要关闭这个问题,正如KJ所说,.bin文件中的实际.msg文件内容将以字节\xd0\xcf\x11\xe0
(具体是该字节序列的第二个实例)开头。
我进行了一些测试,看起来.bin文件在末尾添加的页脚填充以[SomeRandomByte]\x00\x00\x00C\x00:
开头。该序列的第一个字节似乎是可变的,所以我在删除其他所有内容后将其删除。
我能够通过从第二个\xd0\xcf\x11\xe0
序列开始,并在切掉包括[SomeRandomByte]\x00\x00\x00C\x00:
序列在内的所有内容后找到内容。
英文:
To close this out, as KJ stated, the actual .msg file content in the .bin file will start with the bytes \xd0\xcf\x11\xe0
(specifically the second instance of that sequence of bytes).
I did some testing, and it looks like the footer padding added by the .bin file at the end begins with [SomeRandomByte]\x00\x00\x00C\x00:
. The first byte of that sequence appears to be variable, so I just delete it after removing everything else.
I was able to find the contents by starting with the second \xd0\xcf\x11\xe0
sequence and ending by chopping off everything after and including the [SomeRandomByte]\x00\x00\x00C\x00:
sequence.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论