英文:
How to detect additional mime type in Golang
问题
在net/http
包中有一个http.DetectContentType([]byte)
函数。但是只支持有限数量的类型。如何通过内容而不是扩展名来添加对docx
、doc
、xls
、xlsx
、ppt
、pps
、odt
、ods
、odp
文件的支持呢?
据我所知,这会遇到一些问题,因为docx
/xlsx
/pptx
/odp
/odt
文件与zip
文件具有相同的签名(50 4B 03 04)。
英文:
There are http.DetectContentType([]byte)
function in net/http
package. But only limited number of types are supported. How to add support of docx
, doc
, xls
, xlsx
, ppt
, pps
, odt
, ods
, odp
files not by extension, but by the content.
As far as I know, there are some problems, because docx
/xlsx
/pptx
/odp
/odt
files has the same signature as the zip
file (50 4B 03 04).
答案1
得分: 7
免责声明:我是mimetype的作者。
对于在3年后遇到相同问题的任何人,现在基于内容的MIME类型检测的包如下:
-
- 纯Go语言编写,无需C绑定
- 可以扩展以检测新的MIME类型
- 对于同时匹配多个MIME类型的文件(例如,xlsx和docx被识别为zip),存在问题,因为它将匹配函数存储在映射中,因此无法保证遍历的顺序
- 检测到的MIME类型数量有限
-
- 需要安装libmagic-dev
- 在这3个包中,检测到的MIME类型数量最多
- 可以扩展,但较为困难...请参考
man magic
- libmagic不是线程安全的
-
- 纯Go语言编写,无需C绑定
- 检测到的MIME类型数量比
filetype
更多 - 线程安全
- 可以扩展
英文:
Disclaimer: I'm the author of mimetype.
For anyone having the same problem 3 years later, nowadays the packages for mime type detection based on the content are the following:
-
- pure go, no c bindings
- can be extented to detect new mime types
- has issues with files which pass as more than one mime type (ex: xlsx and docx passing as zip) because it stores matching functions in a map, thus it does not guarantee the order of traversal
- limited number of detected mime types
-
- needs libmagic-dev installed
- of the 3, it has highest number of detected mime types
- can be extended, albeit harder...
man magic
- libmagic is not thread safe
-
- pure go, no c bindings
- higher number of detected mime types than
filetype
- is thread safe
- can be extended
答案2
得分: 2
对于以x
结尾的文件相对容易检测。只需解压缩并读取.rels/_rels
文件。它包含文档中主文件的路径。它由命名空间http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument
表示。只需检查其名称。对于docx,它是document.xml
,对于xlsx,它是workbook.xml
,对于pptx,它是presentation.xml
。
更多信息可以在这里找到ECMA-376。
二进制格式更难检测。基本上,您需要读取MS-CFB文件系统并检查条目:
- 对于doc,是
WordDocument
- 对于xls,是
Workbook
或Book
- 对于ppt,是
PowerPoint Document
- 如果是加密文件,则是
EncryptedPackage
。
英文:
For files with x
at the end are relatively easy to detect. Just unzip it and read .rels/_rels
file. It contains path to the main file in document. It denoted by namespace http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument
. Just check its name. It's document.xml
for docx, workbook.xml
for xlsx and presentation.xml
for pptx.
More info here can be found here ECMA-376.
Binary formats harder to detect. Basically you need to read MS-CFB filesystem and check for entries:
WordDocument
for docWorkbook
orBook
for xlsPowerPoint Document
for pptEncryptedPackage
means file is encrypted.
答案3
得分: 1
目前无法扩展http.DetectContentType
,因为它使用了一个固定的、未导出的"sniffers"切片:https://golang.org/src/net/http/sniff.go(写作时的第49行的sniffSignatures
)。
此外,我快速浏览了godoc.org,寻找更好的包,但没有找到一个既可扩展又以内容为导向的包,符合您的要求。
我的建议是:根据Go的内容嗅探实现(遵循https://mimesniff.spec.whatwg.org/),构建您自己的包。
编辑:如果您愿意使用CGO,并且您在nix上,您可以使用类似https://github.com/jteeuwen/magic的libmagic绑定。
英文:
There's currently no way to extend http.DetectContentType
as it uses a fixed, unexported slice of "sniffers": https://golang.org/src/net/http/sniff.go (sniffSignatures
on line 49 at the time of writing).
Also, I looked quickly through godoc.org in search of a better package but didn't find any that is extensible and content-oriented as you require.
My advice would be: build your own package, guided by Go's content sniffer implementation (which follows https://mimesniff.spec.whatwg.org/).
Edit: If you're willing to use CGO and you're on nix, you could use libmagic bindings like for example https://github.com/jteeuwen/magic.
答案4
得分: 1
我发现了mimemagic,我觉得它比magicmime更好,因为它不使用cgo。但是magicmime在区分application/zip和office文件类型方面更好。
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论