英文:
How to detect additional mime type in Golang
问题
在net/http包中有一个http.DetectContentType([]byte)函数。但是只支持有限数量的类型。如何通过内容而不是扩展名来添加对docx、doc、xls、xlsx、ppt、pps、odt、ods、odp文件的支持呢?
据我所知,这会遇到一些问题,因为docx/xlsx/pptx/odp/odt文件与zip文件具有相同的签名(50 4B 03 04)。
英文:
There are http.DetectContentType([]byte) function in net/http package. But only limited number of types are supported. How to add support of docx, doc, xls, xlsx, ppt, pps, odt, ods, odp files not by extension, but by the content.
As far as I know, there are some problems, because docx/xlsx/pptx/odp/odt files has the same signature as the zip file (50 4B 03 04).
答案1
得分: 7
免责声明:我是mimetype的作者。
对于在3年后遇到相同问题的任何人,现在基于内容的MIME类型检测的包如下:
-
- 纯Go语言编写,无需C绑定
- 可以扩展以检测新的MIME类型
- 对于同时匹配多个MIME类型的文件(例如,xlsx和docx被识别为zip),存在问题,因为它将匹配函数存储在映射中,因此无法保证遍历的顺序
- 检测到的MIME类型数量有限
-
- 需要安装libmagic-dev
- 在这3个包中,检测到的MIME类型数量最多
- 可以扩展,但较为困难...请参考
man magic - libmagic不是线程安全的
-
- 纯Go语言编写,无需C绑定
- 检测到的MIME类型数量比
filetype更多 - 线程安全
- 可以扩展
英文:
Disclaimer: I'm the author of mimetype.
For anyone having the same problem 3 years later, nowadays the packages for mime type detection based on the content are the following:
-
- pure go, no c bindings
- can be extented to detect new mime types
- has issues with files which pass as more than one mime type (ex: xlsx and docx passing as zip) because it stores matching functions in a map, thus it does not guarantee the order of traversal
- limited number of detected mime types
-
- needs libmagic-dev installed
- of the 3, it has highest number of detected mime types
- can be extended, albeit harder...
man magic - libmagic is not thread safe
-
- pure go, no c bindings
- higher number of detected mime types than
filetype - is thread safe
- can be extended
答案2
得分: 2
对于以x结尾的文件相对容易检测。只需解压缩并读取.rels/_rels文件。它包含文档中主文件的路径。它由命名空间http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument表示。只需检查其名称。对于docx,它是document.xml,对于xlsx,它是workbook.xml,对于pptx,它是presentation.xml。
更多信息可以在这里找到ECMA-376。
二进制格式更难检测。基本上,您需要读取MS-CFB文件系统并检查条目:
- 对于doc,是
WordDocument - 对于xls,是
Workbook或Book - 对于ppt,是
PowerPoint Document - 如果是加密文件,则是
EncryptedPackage。
英文:
For files with x at the end are relatively easy to detect. Just unzip it and read .rels/_rels file. It contains path to the main file in document. It denoted by namespace http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument. Just check its name. It's document.xml for docx, workbook.xml for xlsx and presentation.xml for pptx.
More info here can be found here ECMA-376.
Binary formats harder to detect. Basically you need to read MS-CFB filesystem and check for entries:
WordDocumentfor docWorkbookorBookfor xlsPowerPoint Documentfor pptEncryptedPackagemeans file is encrypted.
答案3
得分: 1
目前无法扩展http.DetectContentType,因为它使用了一个固定的、未导出的"sniffers"切片:https://golang.org/src/net/http/sniff.go(写作时的第49行的sniffSignatures)。
此外,我快速浏览了godoc.org,寻找更好的包,但没有找到一个既可扩展又以内容为导向的包,符合您的要求。
我的建议是:根据Go的内容嗅探实现(遵循https://mimesniff.spec.whatwg.org/),构建您自己的包。
编辑:如果您愿意使用CGO,并且您在nix上,您可以使用类似https://github.com/jteeuwen/magic的libmagic绑定。
英文:
There's currently no way to extend http.DetectContentType as it uses a fixed, unexported slice of "sniffers": https://golang.org/src/net/http/sniff.go (sniffSignatures on line 49 at the time of writing).
Also, I looked quickly through godoc.org in search of a better package but didn't find any that is extensible and content-oriented as you require.
My advice would be: build your own package, guided by Go's content sniffer implementation (which follows https://mimesniff.spec.whatwg.org/).
Edit: If you're willing to use CGO and you're on nix, you could use libmagic bindings like for example https://github.com/jteeuwen/magic.
答案4
得分: 1
我发现了mimemagic,我觉得它比magicmime更好,因为它不使用cgo。但是magicmime在区分application/zip和office文件类型方面更好。
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论