英文:
Convert any encoding to UTF 8 in Go
问题
我正在通过IMAP下载消息。接下来,我将解析后的消息添加到MongoDB中。但是我遇到了一个问题,因为MongoDB只支持UTF-8编码。我想将任何编码转换为UTF-8。编码方式各不相同。如何将每个字符串转换为UTF-8呢?
我知道可以将其转换为二进制,但我需要正常的文本,因为我需要在数据库中搜索短语。除非,我可以在二进制中搜索正常的文本吗?
英文:
I'm downloading messages via IMAP. Next I'm adding parsed message into MongoDB. And I've a problem, because MongoDB support only UTF 8. And I wanna convert any encoding to UTF 8. Codes are various. How can I convert each string to UTF 8?
I know, that I can convert to binary, but I have to have normal text, because I have to searching phrases in database. Unless, can I searching normal text in binary?
答案1
得分: 11
我正在使用go-charset
项目来进行这个操作:https://code.google.com/p/go-charset/
它非常简单,你可以根据字符集创建一个读取器,它会自动转换为utf-8。以下是该库的示例代码:
r, err := charset.NewReader(strings.NewReader("\xa35 for Pepp\xe9"), "latin1")
if err != nil {
log.Fatal(err)
}
result, err := ioutil.ReadAll(r)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%s\n", result) //输出为:£5 for Peppé
在我的情况下,我知道字符集是因为它来自网页并且我读取了头部或元标签。如果你需要通过启发式算法自动检测字符集,你需要使用另一个库,比如这个:https://github.com/saintfish/chardet
我没有使用过它,但它看起来也很简单:
detector := chardet.NewTextDetector()
result, err := detector.DetectBest(some_text)
if err == nil {
fmt.Printf(
"检测到的字符集为 %s,语言为 %s",
result.Charset,
result.Language)
}
英文:
I'm using the go-charset
project to do this: https://code.google.com/p/go-charset/
It's pretty straightforward, you create a reader from a charset and it translates to utf-8 automatically. example from the library:
r, err := charset.NewReader(strings.NewReader("\xa35 for Pepp\xe9"), "latin1")
if err != nil {
log.Fatal(err)
}
result, err := ioutil.ReadAll(r)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%s\n", result) //outputs £5 for Peppé
Now, in my case I know the charset because it comes from web pages and I read the headers/meta tags. If you need to detect the charset automatically by heuristics, you'll need another library for that, such as this one: https://github.com/saintfish/chardet
I haven't used it but it also looks pretty simple to use:
detector := chardet.NewTextDetector()
result, err := detector.DetectBest(some_text)
if err == nil {
fmt.Printf(
"Detected charset is %s, language is %s",
result.Charset,
result.Language)
}
答案2
得分: 6
在2020年,我发现https://pkg.go.dev/mod/golang.org/x/text对我很有用。
package main
import (
"bytes"
"fmt"
"io/ioutil"
"log"
"golang.org/x/text/encoding/ianaindex"
"golang.org/x/text/transform"
)
func main() {
text := "\xa35 for Pepp\xe9"
charset := "latin1"
e, err := ianaindex.MIME.Encoding(charset)
if err != nil {
log.Fatal(err)
}
r := transform.NewReader(bytes.NewBufferString(text), e.NewDecoder())
result, err := ioutil.ReadAll(r)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%s\n", result) //输出 £5 for Peppé
}
https://play.golang.org/p/Hl7r146UwhT
英文:
In 2020 I found that https://pkg.go.dev/mod/golang.org/x/text worked well for me.
package main
import (
"bytes"
"fmt"
"io/ioutil"
"log"
"golang.org/x/text/encoding/ianaindex"
"golang.org/x/text/transform"
)
func main() {
text := "\xa35 for Pepp\xe9"
charset := "latin1"
e, err := ianaindex.MIME.Encoding(charset)
if err != nil {
log.Fatal(err)
}
r := transform.NewReader(bytes.NewBufferString(text), e.NewDecoder())
result, err := ioutil.ReadAll(r)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%s\n", result) //outputs £5 for Peppé
}
答案3
得分: 5
golang.org/x/net/html/charset
包中的charset.NewReader
无法处理gb2312
编码。而charset.NewReaderLabel
可以处理它。
import (
"io/ioutil"
"golang.org/x/net/html/charset"
)
func convrtToUTF8(str string, origEncoding string) string {
strBytes := []byte(str)
byteReader := bytes.NewReader(strBytes)
reader, _ := charset.NewReaderLabel(origEncoding, byteReader)
strBytes, _ = ioutil.ReadAll(reader)
return string(strBytes)
}
以上是代码的翻译部分。
英文:
charset.NewReader
in package golang.org/x/net/html/charset
can't deal with encoding gb2312
. charset.NewReaderLabel
can deal with it.
import (
"io/ioutil"
"golang.org/x/net/html/charset"
)
func convrtToUTF8(str string, origEncoding string) string {
strBytes := []byte(str)
byteReader := bytes.NewReader(strBytes)
reader, _ := charset.NewReaderLabel(origEncoding, byteReader)
strBytes, _ = ioutil.ReadAll(reader)
return string(strBytes)
}
答案4
得分: 1
我找到了一个更好的包,它使用iconv。使用方法很简单,文档中有详细说明。例如:
output,_ := iconv.ConvertString("Hello World!", "windows-1252", "utf-8")
包的链接:https://github.com/djimenez/iconv-go
英文:
I've found a better package, which uses iconv. Usage is trivial, it is described in the documentation. For example:
output,_ := iconv.ConvertString("Hello World!", "windows-1252", "utf-8")
Link to the package: https://github.com/djimenez/iconv-go
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论