How to convert from an encoding to UTF-8 in Go?

huangapple go评论78阅读模式
英文:

How to convert from an encoding to UTF-8 in Go?

问题

在Go语言中,你可以使用golang.org/x/text/encoding包来进行文本编码的转换。下面是一个示例代码,演示如何将Windows-1256编码的文本转换为UTF-8编码:

package main

import (
	"fmt"
	"golang.org/x/text/encoding/charmap"
	"io/ioutil"
	"log"
)

func main() {
	// 读取Windows-1256编码的文本文件
	encodedText, err := ioutil.ReadFile("input.txt")
	if err != nil {
		log.Fatal(err)
	}

	// 创建Windows-1256编码的解码器
	decoder := charmap.Windows1256.NewDecoder()

	// 将编码的文本转换为UTF-8编码
	utf8Text, err := decoder.Bytes(encodedText)
	if err != nil {
		log.Fatal(err)
	}

	// 打印转换后的UTF-8文本
	fmt.Println(string(utf8Text))
}

在上面的示例中,我们首先使用ioutil.ReadFile函数读取Windows-1256编码的文本文件。然后,我们使用charmap.Windows1256.NewDecoder()创建一个Windows-1256编码的解码器。最后,我们使用解码器的Bytes方法将编码的文本转换为UTF-8编码,并将结果打印出来。

请注意,你需要在代码中替换input.txt为你实际使用的文件路径。此外,你还需要确保你的Go环境中已经安装了golang.org/x/text包。你可以使用以下命令安装该包:

go get golang.org/x/text

希望这可以帮助到你!如果你有任何其他问题,请随时问我。

英文:

I'm working on a project where I need to convert text from an encoding (for example Windows-1256 Arabic) to UTF-8.

How do I do this in Go?

答案1

得分: 18

你可以使用encoding包,该包通过golang.org/x/text/encoding/charmap包提供对Windows-1256的支持(在下面的示例中,导入此包并使用charmap.Windows1256代替japanese.ShiftJIS)。

下面是一个简短的示例,它将一个日语的UTF-8字符串编码为ShiftJIS编码,然后将ShiftJIS字符串解码回UTF-8。不幸的是,由于Playground没有x包,所以它在Playground上无法运行。

package main

import (
	"bytes"
	"fmt"
	"io/ioutil"
	"strings"

	"golang.org/x/text/encoding/japanese"
	"golang.org/x/text/transform"
)

func main() {
	// 要转换的字符串
	s := "今日は"
	fmt.Println(s)

	// --- 编码:将s从UTF-8转换为ShiftJIS
	// 声明一个bytes.Buffer b 和一个编码器,它将写入这个buffer
	var b bytes.Buffer
	wInUTF8 := transform.NewWriter(&b, japanese.ShiftJIS.NewEncoder())
	// 编码字符串
	wInUTF8.Write([]byte(s))
	wInUTF8.Close()
	// 打印编码后的字节
	fmt.Printf("%#v\n", b)
	encS := b.String()
	fmt.Println(encS)

	// --- 解码:将encS从ShiftJIS转换为UTF-8
	// 声明一个解码器,它从我们刚刚编码的字符串中读取
	rInUTF8 := transform.NewReader(strings.NewReader(encS), japanese.ShiftJIS.NewDecoder())
	// 解码字符串
	decBytes, _ := ioutil.ReadAll(rInUTF8)
	decS := string(decBytes)
	fmt.Println(decS)
}

在日本的StackOverflow网站上有一个更完整的示例。文本是日语,但代码应该是不言自明的:https://ja.stackoverflow.com/questions/6120

英文:

You can use the encoding package, which includes support for Windows-1256 via the package golang.org/x/text/encoding/charmap (in the example below, import this package and use charmap.Windows1256 instead of japanese.ShiftJIS).

Here's a short example which encodes a japanese UTF-8 string to ShiftJIS encoding and then decodes the ShiftJIS string back to UTF-8. Unfortunately it doesn't work on the playground since the playground doesn't have the "x" packages.

package main

import (
	"bytes"
	"fmt"
	"io/ioutil"
	"strings"

	"golang.org/x/text/encoding/japanese"
	"golang.org/x/text/transform"
)

func main() {
	// the string we want to transform
	s := "今日は"
	fmt.Println(s)

	// --- Encoding: convert s from UTF-8 to ShiftJIS 
	// declare a bytes.Buffer b and an encoder which will write into this buffer
	var b bytes.Buffer
	wInUTF8 := transform.NewWriter(&b, japanese.ShiftJIS.NewEncoder())
	// encode our string
	wInUTF8.Write([]byte(s))
	wInUTF8.Close()
	// print the encoded bytes
	fmt.Printf("%#v\n", b)
	encS := b.String()
	fmt.Println(encS)

	// --- Decoding: convert encS from ShiftJIS to UTF8
	// declare a decoder which reads from the string we have just encoded
	rInUTF8 := transform.NewReader(strings.NewReader(encS), japanese.ShiftJIS.NewDecoder())
	// decode our string
	decBytes, _ := ioutil.ReadAll(rInUTF8)
	decS := string(decBytes)
	fmt.Println(decS)
}

There's a more complete example on the Japanese StackOverflow site. The text is Japanese, but the code should be self-explanatory: https://ja.stackoverflow.com/questions/6120

答案2

得分: 1

我查看了文档,在这里,并找到了一种将字节数组转换为(或从)UTF-8的方法。

我遇到的困难是,到目前为止,我还没有找到一个允许我使用区域设置的接口。相反,似乎只能使用预定义的编码集。

在我的情况下,我需要将UTF-16(实际上是USC-2数据,但应该仍然适用)转换为UTF-8。为了做到这一点,我需要检查BOM,然后进行转换:

bom := buf[0] + buf[1] * 256
if bom == 0xFEFF {
    enc = unicode.UTF16(unicode.LittleEndian, unicode.IgnoreBOM)
} else if bom == 0xFFFE {
    enc = unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
} else {
    return Error("BOM missing")
}

e := enc.NewDecoder()

// 将USC-2(LE或BE)转换为UTF-8
utf8 := e.Bytes(buf[2:])

很不幸,我必须使用"ignore" BOM,因为在我的情况下,它应该在第一个字符之后被禁止。但对于我的情况来说,这已经足够接近了。这些函数在几个地方提到过,但没有实际展示。

英文:

I checked out the docs, here, and I came up with a way to convert an array of bytes to (or from) UTF-8.

What I have a hard time with is that, so far, I've not found an interface that would allow me to use a locale. Instead, it's like the possible ways are limited to predefined sets of encodings.

In my case, I needed to convert UTF-16 (really I have USC-2 data, but it should still work) to UTF-8. To do that, I needed to check for the BOM and then do the conversion:

bom := buf[0] + buf[1] * 256
if bom == 0xFEFF {
    enc = unicode.UTF16(unicode.LittleEndian, unicode.IgnoreBOM)
} else if bom == 0xFFFE {
    enc = unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
} else {
    return Error("BOM missing")
}

e := enc.NewDecoder()

// convert USC-2 (LE or BE) to UTF-8
utf8 := e.Bytes(buf[2:])

Unfortunate that I have to use "ignore" BOM since in my case it should instead be forbidden past the first character. But that's close enough for my situation. These functions were mentioned in a couple of places, but not shown in practice.

答案3

得分: 0

我为自己制作了一个工具,也许你可以从中借鉴一些想法 How to convert from an encoding to UTF-8 in Go?

https://github.com/gonejack/transcode

这是关键代码:

_, err = io.Copy(
	transform.NewWriter(output, targetEncoding.NewEncoder()),
	transform.NewReader(input, sourceEncoding.NewDecoder()),
)
英文:

I made a tool for myself, maybe you could borrow some idea from it How to convert from an encoding to UTF-8 in Go?

https://github.com/gonejack/transcode

This is the key code:

_, err = io.Copy(
	transform.NewWriter(output, targetEncoding.NewEncoder()),
	transform.NewReader(input, sourceEncoding.NewDecoder()),
)

huangapple
  • 本文由 发表于 2015年9月11日 15:58:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/32518432.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定