英文:
Go string appears shorter than it's first rune
问题
我在我的代码上运行了一些模糊测试,并发现了一个错误。我已经将其简化为以下代码片段,但我无法看出问题在哪里。
给定字符串
s := string("\xc0")
len(s)
函数返回 1
。然而,如果你循环遍历字符串,第一个符文的长度是 3。
for _, r := range s {
fmt.Println("len of rune:", utf8.RuneLen(r)) // 将打印 3
}
我的假设是:
len(string)
返回字符串中的字节数utf8.RuneLen(r)
返回符文中的字节数
我猜我对某些事情有误解,但是一个字符串的长度怎么可能小于其中一个符文的长度呢?
Playground 链接:https://go.dev/play/p/SH3ZI2IZyrL
英文:
I was running some fuzzing on my code and it found a bug. I have reduced it down to the following code snippet and I cannot see what is wrong.
Given the string
s := string("\xc0")
The len(s)
function returns 1
. However, if you loop through the string the first rune is length 3.
for _, r := range s {
fmt.Println("len of rune:", utf8.RuneLen(r)) // Will print 3
}
My assumptions are:
len(string)
is returning the number of bytes in the stringutf8.RuneLen(r)
is returning the number of bytes in the rune
I assume I am misunderstanding something, but how can the length of a string be less than the length of one of it's runes?
Playground here: https://go.dev/play/p/SH3ZI2IZyrL
答案1
得分: 2
解释很简单:你的输入不是有效的UTF-8编码字符串。
fmt.Println(utf8.ValidString(s))
这将输出:false
。
在for range
循环中,对于一个string
,它会遍历字符串中的符文(runes),但如果遇到无效的UTF-8序列,将会为r
设置Unicode替换字符0xFFFD
。规范:For语句:
> 对于一个字符串值,"range"子句会从字节索引0开始遍历字符串中的Unicode码点。在后续的迭代中,索引值将是字符串中连续的UTF-8编码码点的第一个字节的索引,第二个值(类型为rune)将是相应码点的值。如果迭代遇到无效的UTF-8序列,第二个值将是0xFFFD
,即Unicode替换字符,并且下一次迭代将在字符串中前进一个字节。
这也适用于你的情况:对于使用UTF-8编码的3个字节的r
,你得到的是0xfffd
。
如果你使用一个包含\xc0
的有效字符串:
s = string([]rune{'\xc0'})
那么输出将是:
s的长度:2
s中的符文数:1
符文的长度:2
s的UTF-8字节:[195 128]
s的十六进制UTF-8字节:c3 80
在Go Playground上试一试。
英文:
The explanation is simple: your input is not valid UTF-8 encoded string.
fmt.Println(utf8.ValidString(s))
This outputs: false
.
The for range
over a string
ranges over its runes, but if an invalid UTF-8 sequence is encountered, the Unicode replacement character 0xFFFD
is set for r
. Spec: For statements:
> For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD
, the Unicode replacement character, and the next iteration will advance a single byte in the string.
This applies to your case: you get 0xfffd
for r
which has 3 bytes using UTF-8 encoding.
If you go with a valid string holding a rune
of \xc0
:
s = string([]rune{'\xc0'})
Then output is:
len of s: 2
runes in s: 1
len of rune: 2
UTF-8 bytes of s: [195 128]
Hexa UTF-8 bytes of s: c3 80
Try it on the Go Playground.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论