golang将iso8859-1转换为utf8

huangapple go评论96阅读模式
英文:

golang convert iso8859-1 to utf8

问题

我正在尝试将一个ISO 8859-1编码的字符串转换为UTF-8。

以下函数适用于包含德语umlauts的测试数据,但我不太确定rune(b)转换所假设的源编码是什么。它是否假设某种默认编码,例如ISO8859-1,或者是否有任何方法告诉它要使用哪种编码?

func toUtf8(iso8859_1_buf []byte) string {
   var buf = bytes.NewBuffer(make([]byte, len(iso8859_1_buf)*4))
   for _, b := range(iso8859_1_buf) {
      r := rune(b)
      buf.WriteRune(r)
   }
   return string(buf.Bytes())
}
英文:

I am trying to convert an ISO 8859-1 encoded string to UTF-8.

The following function works with my testdata which contains german umlauts, but I'm not quite sure what source encoding the rune(b) cast assumes. Is it assuming some kind of default encoding, e.g. ISO8859-1 or is there any way to tell it what encoding to use?

func toUtf8(iso8859_1_buf []byte) string {
   var buf = bytes.NewBuffer(make([]byte, len(iso8859_1_buf)*4))
   for _, b := range(iso8859_1_buf) {
      r := rune(b)
      buf.WriteRune(r)
   }
   return string(buf.Bytes())
}

答案1

得分: 20

runeint32 的别名,当涉及到编码时,假设一个 rune 具有一个 Unicode 字符值(码点)。所以在 rune(b) 中,b 的值应该是一个 Unicode 值。对于 0x00 - 0xFF,这个值与 Latin-1 是相同的,所以你不需要担心它。

然后你需要将这些 rune 编码为 UTF8。但是这个编码只需要将 []rune 转换为 string 即可。

这是一个不使用 bytes 包的函数示例:

func toUtf8(iso8859_1_buf []byte) string {
    buf := make([]rune, len(iso8859_1_buf))
    for i, b := range iso8859_1_buf {
        buf[i] = rune(b)
    }
    return string(buf)
}
英文:

rune is an alias for int32, and when it comes to encoding, a rune is assumed to have a Unicode character value (code point). So the value b in rune(b) should be a unicode value. For 0x00 - 0xFF this value is identical to Latin-1, so you don't have to worry about it.

Then you need to encode the runes into UTF8. But this encoding is simply done by converting a []rune to string.

This is an example of your function without using the bytes package:

func toUtf8(iso8859_1_buf []byte) string {
	buf := make([]rune, len(iso8859_1_buf))
	for i, b := range iso8859_1_buf {
		buf[i] = rune(b)
	}
	return string(buf)
}

答案2

得分: 2

r := rune(expression)的效果是:

  • 声明类型为rune(int32的别名)的变量r。
  • 用表达式的值初始化变量r。

这不涉及(重新)编码,并且只有在代码中显式编写/处理一些重新编码时才能选择使用哪种编码。幸运的是,在这种情况下不需要(重新)编码,Unicode以与ASCII相似的方式将ISO 8859-1的这些代码合并了进来。(如果我检查正确的话,请参考这里

英文:

The effect of

r := rune(expression)

is:

  • Declare variable r with type rune (alias for int32).
  • Initialize variable r with the value of expresion.

No (re)encoding is involved and saying which one should be optionally used is possible only by explicitly writing/handling some re-encoding in code. Luckily, in this case no (re)encoding is necessary, Unicode incorporated those codes of ISO 8859-1 in a comparable way as ASCII. (If I checked correctly here)

答案3

得分: 0

要在ISO-8859变体(和其他流行的遗留代码页)和UTF-8之间进行转换,请使用golang.org/x/text/encoding/charmap

要解码此Latin1编码:

// rivière, è latin1-encoded as 233 (0xe9)
bLatin1 := []byte{114, 105, 118, 105, 233, 114, 101}

Charmap类型有一个NewDecoder方法,返回一个*encoding.Decoder:

dec8859_1 := charmap.ISO8859_1.NewDecoder()

此解码器可以直接解码字节:

bUTF8, _ := dec8859_1.Bytes(bLatin1)

fmt.Printf("% #x\n", bLatin1) // 0x72 0x69 0x76 0x69 0xe9 0x72 0x65
fmt.Printf("% #x\n", bUTF8)   // 0x72 0x69 0x76 0x69 0xc3 0xa9 0x72 0x65

如果您有一个使用遗留编码的文件:

f, _ := os.Create("foo.txt")
f.Write(bLatin1)
f.Write([]byte("\n"))
f.Write([]byte("Seine"))

使用解码器来包装您的文件的Reader:

f, _ = os.Open("foo.txt")
rLatin1 := dec8859_1.Reader(f)

并传递新的解码器-Reader:

scanner := bufio.NewScanner(rLatin1)

for i := 1; scanner.Scan(); i++ {
    fmt.Printf("line %d: %s\n", i, scanner.Text())
}
// line 1: riviére
// line 2: Seine
英文:

To convert between any of the ISO-8859 variants (and other popular legacy code pages) and UTF-8 use golang.org/x/text/encoding/charmap.

To decode this latin1 encoding:

// rivière, è latin1-encoded as 233 (0xe9)
bLatin1 := []byte{114, 105, 118, 105, 233, 114, 101}

the Charmap type has a NewDecoder method that returns a *encoding.Decoder:

dec8859_1 := charmap.ISO8859_1.NewDecoder()

This decoder can decode bytes directly:

bUTF8, _ := dec8859_1.Bytes(bLatin1)

fmt.Printf("% #x\n", bLatin1) // 0x72 0x69 0x76 0x69 0xe9 0x72 0x65
fmt.Printf("% #x\n", bUTF8)   // 0x72 0x69 0x76 0x69 0xc3 0xa9 0x72 0x65

If you have file with a legacy encoding:

f, _ := os.Create("foo.txt")
f.Write(bLatin1)
f.Write([]byte("\n"))
f.Write([]byte("Seine"))

use the decoder to wrap your file's Reader:

f, _ = os.Open("foo.txt")
rLatin1 := dec8859_1.Reader(f)

and pass the new decoder-Reader:

scanner := bufio.NewScanner(rLatin1)

for i := 1; scanner.Scan(); i++ {
    fmt.Printf("line %d: %s\n", i, scanner.Text())
}
// line 1: riviére
// line 2: Seine

huangapple
  • 本文由 发表于 2012年11月22日 18:18:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/13510458.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定