英文:
Convert any encoding to UTF 8 in Go
问题
我正在通过IMAP下载消息。接下来,我将解析后的消息添加到MongoDB中。但是我遇到了一个问题,因为MongoDB只支持UTF-8编码。我想将任何编码转换为UTF-8。编码方式各不相同。如何将每个字符串转换为UTF-8呢?
我知道可以将其转换为二进制,但我需要正常的文本,因为我需要在数据库中搜索短语。除非,我可以在二进制中搜索正常的文本吗?
英文:
I'm downloading messages via IMAP. Next I'm adding parsed message into MongoDB. And I've a problem, because MongoDB support only UTF 8. And I wanna convert any encoding to UTF 8. Codes are various. How can I convert each string to UTF 8?
I know, that I can convert to binary, but I have to have normal text, because I have to searching phrases in database. Unless, can I searching normal text in binary?
答案1
得分: 11
我正在使用go-charset项目来进行这个操作:https://code.google.com/p/go-charset/
它非常简单,你可以根据字符集创建一个读取器,它会自动转换为utf-8。以下是该库的示例代码:
r, err := charset.NewReader(strings.NewReader("\xa35 for Pepp\xe9"), "latin1")
if err != nil {
    log.Fatal(err)
}
result, err := ioutil.ReadAll(r)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("%s\n", result)  //输出为:£5 for Peppé
在我的情况下,我知道字符集是因为它来自网页并且我读取了头部或元标签。如果你需要通过启发式算法自动检测字符集,你需要使用另一个库,比如这个:https://github.com/saintfish/chardet
我没有使用过它,但它看起来也很简单:
detector := chardet.NewTextDetector()
result, err := detector.DetectBest(some_text)
if err == nil {
    fmt.Printf(
        "检测到的字符集为 %s,语言为 %s",
        result.Charset,
        result.Language)
}
英文:
I'm using the go-charset project to do this: https://code.google.com/p/go-charset/
It's pretty straightforward, you create a reader from a charset and it translates to utf-8 automatically. example from the library:
r, err := charset.NewReader(strings.NewReader("\xa35 for Pepp\xe9"), "latin1")
if err != nil {
    log.Fatal(err)
}
result, err := ioutil.ReadAll(r)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("%s\n", result)  //outputs £5 for Peppé
Now, in my case I know the charset because it comes from web pages and I read the headers/meta tags. If you need to detect the charset automatically by heuristics, you'll need another library for that, such as this one: https://github.com/saintfish/chardet
I haven't used it but it also looks pretty simple to use:
detector := chardet.NewTextDetector()
result, err := detector.DetectBest(some_text)
if err == nil {
	fmt.Printf(
		"Detected charset is %s, language is %s",
		result.Charset,
		result.Language)
}
答案2
得分: 6
在2020年,我发现https://pkg.go.dev/mod/golang.org/x/text对我很有用。
package main
import (
	"bytes"
	"fmt"
	"io/ioutil"
	"log"
	"golang.org/x/text/encoding/ianaindex"
	"golang.org/x/text/transform"
)
func main() {
	text := "\xa35 for Pepp\xe9"
	charset := "latin1"
	e, err := ianaindex.MIME.Encoding(charset)
	if err != nil {
		log.Fatal(err)
	}
	r := transform.NewReader(bytes.NewBufferString(text), e.NewDecoder())
	result, err := ioutil.ReadAll(r)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("%s\n", result) //输出 £5 for Peppé
}
https://play.golang.org/p/Hl7r146UwhT
英文:
In 2020 I found that https://pkg.go.dev/mod/golang.org/x/text worked well for me.
package main
import (
	"bytes"
	"fmt"
	"io/ioutil"
	"log"
	"golang.org/x/text/encoding/ianaindex"
	"golang.org/x/text/transform"
)
func main() {
	text := "\xa35 for Pepp\xe9"
	charset := "latin1"
	e, err := ianaindex.MIME.Encoding(charset)
	if err != nil {
		log.Fatal(err)
	}
	r := transform.NewReader(bytes.NewBufferString(text), e.NewDecoder())
	result, err := ioutil.ReadAll(r)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("%s\n", result) //outputs £5 for Peppé
}
答案3
得分: 5
golang.org/x/net/html/charset包中的charset.NewReader无法处理gb2312编码。而charset.NewReaderLabel可以处理它。
import (
    "io/ioutil"
    "golang.org/x/net/html/charset"
)
func convrtToUTF8(str string, origEncoding string) string {
    strBytes := []byte(str)
    byteReader := bytes.NewReader(strBytes)
    reader, _ := charset.NewReaderLabel(origEncoding, byteReader)
    strBytes, _ = ioutil.ReadAll(reader)
    return string(strBytes)
}
以上是代码的翻译部分。
英文:
charset.NewReader in package golang.org/x/net/html/charset can't deal with encoding gb2312. charset.NewReaderLabel can deal with it.
import 	(
    "io/ioutil"
    "golang.org/x/net/html/charset"
)
func convrtToUTF8(str string, origEncoding string) string {
    strBytes := []byte(str)
    byteReader := bytes.NewReader(strBytes)
    reader, _ := charset.NewReaderLabel(origEncoding, byteReader)
    strBytes, _ = ioutil.ReadAll(reader)
    return string(strBytes)
}
答案4
得分: 1
我找到了一个更好的包,它使用iconv。使用方法很简单,文档中有详细说明。例如:
output,_ := iconv.ConvertString("Hello World!", "windows-1252", "utf-8")
包的链接:https://github.com/djimenez/iconv-go
英文:
I've found a better package, which uses iconv. Usage is trivial, it is described in the documentation. For example:
output,_ := iconv.ConvertString("Hello World!", "windows-1252", "utf-8")
Link to the package: https://github.com/djimenez/iconv-go
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论