如何在Golang中检测附加的MIME类型

huangapple go评论82阅读模式
英文:

How to detect additional mime type in Golang

问题

net/http包中有一个http.DetectContentType([]byte)函数。但是只支持有限数量的类型。如何通过内容而不是扩展名来添加对docxdocxlsxlsxpptppsodtodsodp文件的支持呢?

据我所知,这会遇到一些问题,因为docx/xlsx/pptx/odp/odt文件与zip文件具有相同的签名(50 4B 03 04)。

英文:

There are http.DetectContentType([]byte) function in net/http package. But only limited number of types are supported. How to add support of docx, doc, xls, xlsx, ppt, pps, odt, ods, odp files not by extension, but by the content.
As far as I know, there are some problems, because docx/xlsx/pptx/odp/odt files has the same signature as the zip file (50 4B 03 04).

答案1

得分: 7

免责声明:我是mimetype的作者。

对于在3年后遇到相同问题的任何人,现在基于内容的MIME类型检测的包如下:

  • filetype

    • 纯Go语言编写,无需C绑定
    • 可以扩展以检测新的MIME类型
    • 对于同时匹配多个MIME类型的文件(例如,xlsx和docx被识别为zip),存在问题,因为它将匹配函数存储在映射中,因此无法保证遍历的顺序
    • 检测到的MIME类型数量有限
  • magicmime

    • 需要安装libmagic-dev
    • 在这3个包中,检测到的MIME类型数量最多
    • 可以扩展,但较为困难...请参考man magic
    • libmagic不是线程安全的
  • mimetype

    • 纯Go语言编写,无需C绑定
    • 检测到的MIME类型数量比filetype更多
    • 线程安全
    • 可以扩展
英文:

Disclaimer: I'm the author of mimetype.

For anyone having the same problem 3 years later, nowadays the packages for mime type detection based on the content are the following:

  • filetype

    • pure go, no c bindings
    • can be extented to detect new mime types
    • has issues with files which pass as more than one mime type (ex: xlsx and docx passing as zip) because it stores matching functions in a map, thus it does not guarantee the order of traversal
    • limited number of detected mime types
  • magicmime

    • needs libmagic-dev installed
    • of the 3, it has highest number of detected mime types
    • can be extended, albeit harder... man magic
    • libmagic is not thread safe
  • mimetype

    • pure go, no c bindings
    • higher number of detected mime types than filetype
    • is thread safe
    • can be extended

答案2

得分: 2

对于以x结尾的文件相对容易检测。只需解压缩并读取.rels/_rels文件。它包含文档中主文件的路径。它由命名空间http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument表示。只需检查其名称。对于docx,它是document.xml,对于xlsx,它是workbook.xml,对于pptx,它是presentation.xml

更多信息可以在这里找到ECMA-376

二进制格式更难检测。基本上,您需要读取MS-CFB文件系统并检查条目:

  • 对于doc,是WordDocument
  • 对于xls,是WorkbookBook
  • 对于ppt,是PowerPoint Document
  • 如果是加密文件,则是EncryptedPackage
英文:

For files with x at the end are relatively easy to detect. Just unzip it and read .rels/_rels file. It contains path to the main file in document. It denoted by namespace http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument. Just check its name. It's document.xml for docx, workbook.xml for xlsx and presentation.xml for pptx.

More info here can be found here ECMA-376.

Binary formats harder to detect. Basically you need to read MS-CFB filesystem and check for entries:

  • WordDocument for doc
  • Workbook or Book for xls
  • PowerPoint Document for ppt
  • EncryptedPackage means file is encrypted.

答案3

得分: 1

目前无法扩展http.DetectContentType,因为它使用了一个固定的、未导出的"sniffers"切片:https://golang.org/src/net/http/sniff.go(写作时的第49行的sniffSignatures)。

此外,我快速浏览了godoc.org,寻找更好的包,但没有找到一个既可扩展又以内容为导向的包,符合您的要求。

我的建议是:根据Go的内容嗅探实现(遵循https://mimesniff.spec.whatwg.org/),构建您自己的包。

编辑:如果您愿意使用CGO,并且您在nix上,您可以使用类似https://github.com/jteeuwen/magic的libmagic绑定。

英文:

There's currently no way to extend http.DetectContentType as it uses a fixed, unexported slice of "sniffers": https://golang.org/src/net/http/sniff.go (sniffSignatures on line 49 at the time of writing).

Also, I looked quickly through godoc.org in search of a better package but didn't find any that is extensible and content-oriented as you require.

My advice would be: build your own package, guided by Go's content sniffer implementation (which follows https://mimesniff.spec.whatwg.org/).

Edit: If you're willing to use CGO and you're on nix, you could use libmagic bindings like for example https://github.com/jteeuwen/magic.

答案4

得分: 1

我发现了mimemagic,我觉得它比magicmime更好,因为它不使用cgo。但是magicmime在区分application/zip和office文件类型方面更好。

英文:

I found mimemagic, which I find preferable to magicmime since it doesn't use cgo. But magicmime is better at differentiating between application/zip and office file types.

huangapple
  • 本文由 发表于 2015年4月24日 11:24:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/29838185.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定