英文:
How to detect when bytes can't be converted to string in Go?
问题
在Go语言中,将[]byte
转换为string
时,可能会遇到无法转换为Unicode字符串的无效字节序列。如何检测这种情况呢?
你可以使用utf8.Valid
函数来检测一个字节序列是否是有效的UTF-8编码。这个函数接受一个[]byte
参数,并返回一个布尔值,指示字节序列是否有效。
以下是一个示例代码:
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
bytes := []byte{0xC3, 0x28} // 无效的字节序列
if utf8.Valid(bytes) {
str := string(bytes)
fmt.Println("转换成功:", str)
} else {
fmt.Println("无效的字节序列")
}
}
在上面的示例中,我们定义了一个包含无效字节序列的[]byte
,然后使用utf8.Valid
函数检测它是否有效。如果有效,我们将其转换为字符串并打印输出;如果无效,则打印出相应的提示信息。
希望这可以帮助到你!如果你还有其他问题,请随时提问。
英文:
There are invalid byte sequences that can't be converted to Unicode strings. How do I detect that when converting []byte
to string
in Go?
答案1
得分: 27
你可以使用utf8.Valid
函数来测试UTF-8的有效性,就像Tim Cooper提到的那样。
但是!你可能会认为将非UTF-8字节转换为Go的string
是不可能的。实际上,"在Go中,字符串实际上是一个只读的字节切片";它可以包含不是有效UTF-8的字节,你可以打印它们,通过索引访问它们,将它们传递给WriteString
方法,甚至可以往返转换为[]byte
(比如Write
)。
在语言中有两个地方,Go会对string
进行UTF-8解码。
- 当你使用
for i, r := range s
时,r
是一个Unicode代码点,类型为rune
的值。 - 当你进行转换
[]rune(s)
时,Go会将整个字符串解码为runes。
(请注意,rune
是int32
的别名,而不是完全不同的类型。)
在这两种情况下,无效的UTF-8会被替换为U+FFFD
,这是替换字符,用于此类用途。在规范的"for语句"和"字符串与其他类型之间的转换"部分中有更多信息。这些转换永远不会崩溃,因此只有在与你的应用程序相关时,比如无法接受U+FFFD替换并且需要在错误的编码输入上抛出错误时,才需要主动检查UTF-8的有效性。
由于这种行为已经内置到语言中,你也可以期望库中有相同的行为。U+FFFD
是utf8.RuneError
,并且在utf8
中的函数中返回。
下面是一个示例程序,展示了Go对包含无效UTF-8的[]byte
的处理:
package main
import "fmt"
func main() {
a := []byte{0xff}
s := string(a)
fmt.Println(s)
for _, r := range s {
fmt.Println(r)
}
rs := []rune(s)
fmt.Println(rs)
}
在不同的环境中,输出可能会有所不同,在Playground中的输出如下:
�
65533
[65533]
英文:
You can, as Tim Cooper noted, test UTF-8 validity with utf8.Valid
.
But! You might be thinking that converting non-UTF-8 bytes to a Go string
is impossible. In fact, "In Go, a string is in effect a read-only slice of bytes"; it can contain bytes that aren't valid UTF-8 which you can print, access via indexing, pass to WriteString
methods, or even round-trip back to a []byte
(to Write
, say).
There are two places in the language that Go does do UTF-8 decoding of string
s for you.
- when you do
for i, r := range s
ther
is a Unicode code point as a value of typerune
- when you do the conversion
[]rune(s)
, Go decodes the whole string to runes.
(Note that rune
is an alias for int32
, not a completely different type.)
In both these instances invalid UTF-8 is replaced with U+FFFD
, the replacement character reserved for uses like this. More is in the spec sections on for
statements and conversions between string
s and other types. These conversions never crash, so you only need to actively check for UTF-8 validity if it's relevant to your application, like if you can't accept the U+FFFD replacement and need to throw an error on mis-encoded input.
Since that behavior's baked into the language, you can expect it from libraries, too. U+FFFD
is utf8.RuneError
and returned by functions in utf8
.
Here's a sample program showing what Go does with a []byte
holding invalid UTF-8:
package main
import "fmt"
func main() {
a := []byte{0xff}
s := string(a)
fmt.Println(s)
for _, r := range s {
fmt.Println(r)
}
rs := []rune(s)
fmt.Println(rs)
}
Output will look different in different environments, but in the Playground it looks like
�
65533
[65533]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论