Convert any encoding to UTF 8 in Go

huangapple go评论88阅读模式
英文:

Convert any encoding to UTF 8 in Go

问题

我正在通过IMAP下载消息。接下来,我将解析后的消息添加到MongoDB中。但是我遇到了一个问题,因为MongoDB只支持UTF-8编码。我想将任何编码转换为UTF-8。编码方式各不相同。如何将每个字符串转换为UTF-8呢?

我知道可以将其转换为二进制,但我需要正常的文本,因为我需要在数据库中搜索短语。除非,我可以在二进制中搜索正常的文本吗?

英文:

I'm downloading messages via IMAP. Next I'm adding parsed message into MongoDB. And I've a problem, because MongoDB support only UTF 8. And I wanna convert any encoding to UTF 8. Codes are various. How can I convert each string to UTF 8?

I know, that I can convert to binary, but I have to have normal text, because I have to searching phrases in database. Unless, can I searching normal text in binary?

答案1

得分: 11

我正在使用go-charset项目来进行这个操作:https://code.google.com/p/go-charset/

它非常简单,你可以根据字符集创建一个读取器,它会自动转换为utf-8。以下是该库的示例代码:

r, err := charset.NewReader(strings.NewReader("\xa35 for Pepp\xe9"), "latin1")
if err != nil {
    log.Fatal(err)
}
result, err := ioutil.ReadAll(r)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("%s\n", result)  //输出为:£5 for Peppé

在我的情况下,我知道字符集是因为它来自网页并且我读取了头部或元标签。如果你需要通过启发式算法自动检测字符集,你需要使用另一个库,比如这个:https://github.com/saintfish/chardet

我没有使用过它,但它看起来也很简单:

detector := chardet.NewTextDetector()
result, err := detector.DetectBest(some_text)
if err == nil {
    fmt.Printf(
        "检测到的字符集为 %s,语言为 %s",
        result.Charset,
        result.Language)
}
英文:

I'm using the go-charset project to do this: https://code.google.com/p/go-charset/

It's pretty straightforward, you create a reader from a charset and it translates to utf-8 automatically. example from the library:

r, err := charset.NewReader(strings.NewReader("\xa35 for Pepp\xe9"), "latin1")
if err != nil {
    log.Fatal(err)
}
result, err := ioutil.ReadAll(r)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("%s\n", result)  //outputs £5 for Peppé

Now, in my case I know the charset because it comes from web pages and I read the headers/meta tags. If you need to detect the charset automatically by heuristics, you'll need another library for that, such as this one: https://github.com/saintfish/chardet

I haven't used it but it also looks pretty simple to use:

detector := chardet.NewTextDetector()
result, err := detector.DetectBest(some_text)
if err == nil {
	fmt.Printf(
		"Detected charset is %s, language is %s",
		result.Charset,
		result.Language)
}

答案2

得分: 6

在2020年,我发现https://pkg.go.dev/mod/golang.org/x/text对我很有用。

package main

import (
	"bytes"
	"fmt"
	"io/ioutil"
	"log"

	"golang.org/x/text/encoding/ianaindex"
	"golang.org/x/text/transform"
)

func main() {
	text := "\xa35 for Pepp\xe9"
	charset := "latin1"
	e, err := ianaindex.MIME.Encoding(charset)
	if err != nil {
		log.Fatal(err)
	}
	r := transform.NewReader(bytes.NewBufferString(text), e.NewDecoder())
	result, err := ioutil.ReadAll(r)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("%s\n", result) //输出 £5 for Peppé
}

https://play.golang.org/p/Hl7r146UwhT

英文:

In 2020 I found that https://pkg.go.dev/mod/golang.org/x/text worked well for me.

package main

import (
	"bytes"
	"fmt"
	"io/ioutil"
	"log"

	"golang.org/x/text/encoding/ianaindex"
	"golang.org/x/text/transform"
)

func main() {
	text := "\xa35 for Pepp\xe9"
	charset := "latin1"
	e, err := ianaindex.MIME.Encoding(charset)
	if err != nil {
		log.Fatal(err)
	}
	r := transform.NewReader(bytes.NewBufferString(text), e.NewDecoder())
	result, err := ioutil.ReadAll(r)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("%s\n", result) //outputs £5 for Peppé
}

https://play.golang.org/p/Hl7r146UwhT

答案3

得分: 5

golang.org/x/net/html/charset包中的charset.NewReader无法处理gb2312编码。而charset.NewReaderLabel可以处理它。

import (
    "io/ioutil"
    "golang.org/x/net/html/charset"
)

func convrtToUTF8(str string, origEncoding string) string {
    strBytes := []byte(str)
    byteReader := bytes.NewReader(strBytes)
    reader, _ := charset.NewReaderLabel(origEncoding, byteReader)
    strBytes, _ = ioutil.ReadAll(reader)
    return string(strBytes)
}

以上是代码的翻译部分。

英文:

charset.NewReader in package golang.org/x/net/html/charset can't deal with encoding gb2312. charset.NewReaderLabel can deal with it.

import 	(
    "io/ioutil"
    "golang.org/x/net/html/charset"
)

func convrtToUTF8(str string, origEncoding string) string {
    strBytes := []byte(str)
    byteReader := bytes.NewReader(strBytes)
    reader, _ := charset.NewReaderLabel(origEncoding, byteReader)
    strBytes, _ = ioutil.ReadAll(reader)
    return string(strBytes)
}

答案4

得分: 1

我找到了一个更好的包,它使用iconv。使用方法很简单,文档中有详细说明。例如:

output,_ := iconv.ConvertString("Hello World!", "windows-1252", "utf-8")

包的链接:https://github.com/djimenez/iconv-go

英文:

I've found a better package, which uses iconv. Usage is trivial, it is described in the documentation. For example:

output,_ := iconv.ConvertString("Hello World!", "windows-1252", "utf-8")

Link to the package: https://github.com/djimenez/iconv-go

huangapple
  • 本文由 发表于 2014年12月4日 23:07:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/27297328.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定