如何检测文件的编码?

huangapple go评论102阅读模式
英文:

How can I detect a file's encoding?

问题

我正在尝试使用Go在Windows上找出文件的编码。经过一些研究,我发现了许多关于Mozilla的Charset Detectors(chardet)的推荐,但是它们很难编译,而且我没有任何运气。

我还发现了libguess,在Linux中似乎被广泛使用,但是我无法在Windows中使其工作。

有什么最好的方法来解决这个问题吗?在Windows上,是否有一个事实上的标准库可以与Go一起使用?

英文:

I'm trying figure out the encoding of a file on Windows using Go. Doing some research, I've found many recommendations for Mozilla's Charset Detectors (chardet), but they're hard to compile, and I'm not having any luck.

I've also found libguess, and it seems is widely used in Linux, but I can't make it work in Windows.

What's the best way to go about this? Is there a de-facto standard library to use with Go on Windows?

答案1

得分: 2

你可以使用 python 包:chardet

英文:

You can use python package: chardet.

答案2

得分: 0

你可能对Enca感兴趣,它是一个非常天真的字符集分析器。我猜你可以尝试使用所有候选编码来读取文件,并计算每个尝试与该语言的“标准”字符频率分布相差多远。Enca需要一些语言信息,但我不确定它是否使用这种方法。(这只是一个想法,可能完全错误。)

英文:

You might be interested in Enca, Extremely Naive Charset Analyzer. I guess you could try to read the file using all candidate encodings and compute how far each of the attempts is from a “standard” character frequency distribution for the language. Enca requires some language info, but I’m not sure if it uses this approach. (It’s just an idea, it might be horribly misguided.)

答案3

得分: 0

你可以尝试使用这个通过以字节方式读取字符串来检测其编码。

以下是Go语言的示例代码:

package main

import (
	"fmt"
	"os"
	"github.com/saintfish/chardet"
)

func check(e error) {
	if e != nil {
		panic(e)
	}
}

func main() {
	dat, err := os.ReadFile("/Users/yourname/Downloads/456")
	check(err)
	detector := chardet.NewTextDetector()
	result, err := detector.DetectBest(dat)
	if err == nil {
		fmt.Printf("检测到的字符集为 %s", result.Charset)
	}
}

输出结果将类似于:

检测到的字符集为 ISO-8859-1

英文:

You may try to use this library to detect the encoding of your string by reading them in bytes.

Here is the sample code in Go.

package main

import (
	"fmt"
	"os"
	"github.com/saintfish/chardet"
)

func check(e error) {
	if e != nil {
		panic(e)
	}
}

func main() {
	dat, err := os.ReadFile("/Users/yourname/Downloads/456")
	check(err)
	detector := chardet.NewTextDetector()
	result, err := detector.DetectBest(dat)
	if err == nil {
		fmt.Printf("Detected charset is %s", result.Charset)
	}
}

Output will be like

> Detected charset is ISO-8859-1

huangapple
  • 本文由 发表于 2012年1月12日 23:03:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/8837509.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定