如何在Go中检测字节无法转换为字符串的情况?

huangapple go评论73阅读模式
英文:

How to detect when bytes can't be converted to string in Go?

问题

在Go语言中,将[]byte转换为string时,可能会遇到无法转换为Unicode字符串的无效字节序列。如何检测这种情况呢?

你可以使用utf8.Valid函数来检测一个字节序列是否是有效的UTF-8编码。这个函数接受一个[]byte参数,并返回一个布尔值,指示字节序列是否有效。

以下是一个示例代码:

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
	bytes := []byte{0xC3, 0x28} // 无效的字节序列

	if utf8.Valid(bytes) {
		str := string(bytes)
		fmt.Println("转换成功:", str)
	} else {
		fmt.Println("无效的字节序列")
	}
}

在上面的示例中,我们定义了一个包含无效字节序列的[]byte,然后使用utf8.Valid函数检测它是否有效。如果有效,我们将其转换为字符串并打印输出;如果无效,则打印出相应的提示信息。

希望这可以帮助到你!如果你还有其他问题,请随时提问。

英文:

There are invalid byte sequences that can't be converted to Unicode strings. How do I detect that when converting []byte to string in Go?

答案1

得分: 27

你可以使用utf8.Valid函数来测试UTF-8的有效性,就像Tim Cooper提到的那样。

但是!你可能会认为将非UTF-8字节转换为Go的string是不可能的。实际上,"在Go中,字符串实际上是一个只读的字节切片";它可以包含不是有效UTF-8的字节,你可以打印它们,通过索引访问它们,将它们传递给WriteString方法,甚至可以往返转换为[]byte(比如Write)。

在语言中有两个地方,Go会对string进行UTF-8解码。

  • 当你使用for i, r := range s时,r是一个Unicode代码点,类型为rune的值。
  • 当你进行转换[]rune(s)时,Go会将整个字符串解码为runes。

(请注意,runeint32的别名,而不是完全不同的类型。)

在这两种情况下,无效的UTF-8会被替换为U+FFFD,这是替换字符,用于此类用途。在规范的"for语句""字符串与其他类型之间的转换"部分中有更多信息。这些转换永远不会崩溃,因此只有在与你的应用程序相关时,比如无法接受U+FFFD替换并且需要在错误的编码输入上抛出错误时,才需要主动检查UTF-8的有效性。

由于这种行为已经内置到语言中,你也可以期望库中有相同的行为。U+FFFDutf8.RuneError,并且在utf8中的函数中返回。

下面是一个示例程序,展示了Go对包含无效UTF-8的[]byte的处理:

package main

import "fmt"

func main() {
    a := []byte{0xff}
    s := string(a)
    fmt.Println(s)
    for _, r := range s {
        fmt.Println(r)
    }
    rs := []rune(s)
    fmt.Println(rs)
}

在不同的环境中,输出可能会有所不同,在Playground中的输出如下:

�
65533
[65533]
英文:

You can, as Tim Cooper noted, test UTF-8 validity with utf8.Valid.

But! You might be thinking that converting non-UTF-8 bytes to a Go string is impossible. In fact, "In Go, a string is in effect a read-only slice of bytes"; it can contain bytes that aren't valid UTF-8 which you can print, access via indexing, pass to WriteString methods, or even round-trip back to a []byte (to Write, say).

There are two places in the language that Go does do UTF-8 decoding of strings for you.

  • when you do for i, r := range s the r is a Unicode code point as a value of type rune
  • when you do the conversion []rune(s), Go decodes the whole string to runes.

(Note that rune is an alias for int32, not a completely different type.)

In both these instances invalid UTF-8 is replaced with U+FFFD, the replacement character reserved for uses like this. More is in the spec sections on for statements and conversions between strings and other types. These conversions never crash, so you only need to actively check for UTF-8 validity if it's relevant to your application, like if you can't accept the U+FFFD replacement and need to throw an error on mis-encoded input.

Since that behavior's baked into the language, you can expect it from libraries, too. U+FFFD is utf8.RuneError and returned by functions in utf8.

Here's a sample program showing what Go does with a []byte holding invalid UTF-8:

package main

import "fmt"

func main() {
	a := []byte{0xff}
	s := string(a)
	fmt.Println(s)
	for _, r := range s {
		fmt.Println(r)
	}
	rs := []rune(s)
	fmt.Println(rs)
}

Output will look different in different environments, but in the Playground it looks like

�
65533
[65533]

huangapple
  • 本文由 发表于 2016年1月19日 02:20:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/34861479.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定