2012年11月22日 18:18:12go评论96阅读模式

英文:

golang convert iso8859-1 to utf8

问题

我正在尝试将一个ISO 8859-1编码的字符串转换为UTF-8。

以下函数适用于包含德语umlauts的测试数据，但我不太确定rune(b)转换所假设的源编码是什么。它是否假设某种默认编码，例如ISO8859-1，或者是否有任何方法告诉它要使用哪种编码？

func toUtf8(iso8859_1_buf []byte) string {
   var buf = bytes.NewBuffer(make([]byte, len(iso8859_1_buf)*4))
   for _, b := range(iso8859_1_buf) {
      r := rune(b)
      buf.WriteRune(r)
   }
   return string(buf.Bytes())
}

英文:

I am trying to convert an ISO 8859-1 encoded string to UTF-8.

The following function works with my testdata which contains german umlauts, but I'm not quite sure what source encoding the rune(b) cast assumes. Is it assuming some kind of default encoding, e.g. ISO8859-1 or is there any way to tell it what encoding to use?

func toUtf8(iso8859_1_buf []byte) string {
   var buf = bytes.NewBuffer(make([]byte, len(iso8859_1_buf)*4))
   for _, b := range(iso8859_1_buf) {
      r := rune(b)
      buf.WriteRune(r)
   }
   return string(buf.Bytes())
}

答案1

得分: 20

rune 是 int32 的别名，当涉及到编码时，假设一个 rune 具有一个 Unicode 字符值（码点）。所以在 rune(b) 中，b 的值应该是一个 Unicode 值。对于 0x00 - 0xFF，这个值与 Latin-1 是相同的，所以你不需要担心它。

然后你需要将这些 rune 编码为 UTF8。但是这个编码只需要将 []rune 转换为 string 即可。

这是一个不使用 bytes 包的函数示例：

func toUtf8(iso8859_1_buf []byte) string {
    buf := make([]rune, len(iso8859_1_buf))
    for i, b := range iso8859_1_buf {
        buf[i] = rune(b)
    }
    return string(buf)
}

英文:

rune is an alias for int32, and when it comes to encoding, a rune is assumed to have a Unicode character value (code point). So the value b in rune(b) should be a unicode value. For 0x00 - 0xFF this value is identical to Latin-1, so you don't have to worry about it.

Then you need to encode the runes into UTF8. But this encoding is simply done by converting a []rune to string.

This is an example of your function without using the bytes package:

func toUtf8(iso8859_1_buf []byte) string {
	buf := make([]rune, len(iso8859_1_buf))
	for i, b := range iso8859_1_buf {
		buf[i] = rune(b)
	}
	return string(buf)
}

答案2

得分: 2

r := rune(expression)的效果是：

声明类型为rune（int32的别名）的变量r。
用表达式的值初始化变量r。

这不涉及（重新）编码，并且只有在代码中显式编写/处理一些重新编码时才能选择使用哪种编码。幸运的是，在这种情况下不需要（重新）编码，Unicode以与ASCII相似的方式将ISO 8859-1的这些代码合并了进来。（如果我检查正确的话，请参考这里）

英文:

The effect of

r := rune(expression)

is:

Declare variable r with type rune (alias for int32).
Initialize variable r with the value of expresion.

No (re)encoding is involved and saying which one should be optionally used is possible only by explicitly writing/handling some re-encoding in code. Luckily, in this case no (re)encoding is necessary, Unicode incorporated those codes of ISO 8859-1 in a comparable way as ASCII. (If I checked correctly here)

答案3

得分: 0

要在ISO-8859变体（和其他流行的遗留代码页）和UTF-8之间进行转换，请使用golang.org/x/text/encoding/charmap。

要解码此Latin1编码：

// rivi&#232;re, &#232; latin1-encoded as 233 (0xe9)
bLatin1 := []byte{114, 105, 118, 105, 233, 114, 101}

Charmap类型有一个NewDecoder方法，返回一个*encoding.Decoder：

dec8859_1 := charmap.ISO8859_1.NewDecoder()

此解码器可以直接解码字节：

bUTF8, _ := dec8859_1.Bytes(bLatin1)

fmt.Printf("% #x\n", bLatin1) // 0x72 0x69 0x76 0x69 0xe9 0x72 0x65
fmt.Printf("% #x\n", bUTF8)   // 0x72 0x69 0x76 0x69 0xc3 0xa9 0x72 0x65

如果您有一个使用遗留编码的文件：

f, _ := os.Create("foo.txt")
f.Write(bLatin1)
f.Write([]byte("\n"))
f.Write([]byte("Seine"))

使用解码器来包装您的文件的Reader：

f, _ = os.Open("foo.txt")
rLatin1 := dec8859_1.Reader(f)

并传递新的解码器-Reader：

scanner := bufio.NewScanner(rLatin1)

for i := 1; scanner.Scan(); i++ {
    fmt.Printf("line %d: %s\n", i, scanner.Text())
}
// line 1: rivi&#233;re
// line 2: Seine

英文:

To convert between any of the ISO-8859 variants (and other popular legacy code pages) and UTF-8 use golang.org/x/text/encoding/charmap.

To decode this latin1 encoding:

// rivi&#232;re, &#232; latin1-encoded as 233 (0xe9)
bLatin1 := []byte{114, 105, 118, 105, 233, 114, 101}

the Charmap type has a NewDecoder method that returns a *encoding.Decoder:

dec8859_1 := charmap.ISO8859_1.NewDecoder()

This decoder can decode bytes directly:

bUTF8, _ := dec8859_1.Bytes(bLatin1)

fmt.Printf(&quot;% #x\n&quot;, bLatin1) // 0x72 0x69 0x76 0x69 0xe9 0x72 0x65
fmt.Printf(&quot;% #x\n&quot;, bUTF8)   // 0x72 0x69 0x76 0x69 0xc3 0xa9 0x72 0x65

If you have file with a legacy encoding:

f, _ := os.Create(&quot;foo.txt&quot;)
f.Write(bLatin1)
f.Write([]byte(&quot;\n&quot;))
f.Write([]byte(&quot;Seine&quot;))

use the decoder to wrap your file's Reader:

f, _ = os.Open(&quot;foo.txt&quot;)
rLatin1 := dec8859_1.Reader(f)

and pass the new decoder-Reader:

scanner := bufio.NewScanner(rLatin1)

for i := 1; scanner.Scan(); i++ {
    fmt.Printf(&quot;line %d: %s\n&quot;, i, scanner.Text())
}
// line 1: rivi&#233;re
// line 2: Seine

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

golang将iso8859-1转换为utf8

问题

答案1

答案2

答案3

golang: How can i access json decoded field?

Golang包依赖问题

json.Unmarshal工作不正常

How can i implement concurrency in C like we do in go?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论