Go字符串似乎比它的第一个符文要短。

huangapple go评论79阅读模式
英文:

Go string appears shorter than it's first rune

问题

我在我的代码上运行了一些模糊测试,并发现了一个错误。我已经将其简化为以下代码片段,但我无法看出问题在哪里。

给定字符串

s := string("\xc0")

len(s) 函数返回 1。然而,如果你循环遍历字符串,第一个符文的长度是 3。

	for _, r := range s {
		fmt.Println("len of rune:", utf8.RuneLen(r)) // 将打印 3
	}

我的假设是:

  • len(string) 返回字符串中的字节数
  • utf8.RuneLen(r) 返回符文中的字节数

我猜我对某些事情有误解,但是一个字符串的长度怎么可能小于其中一个符文的长度呢?

Playground 链接:https://go.dev/play/p/SH3ZI2IZyrL

英文:

I was running some fuzzing on my code and it found a bug. I have reduced it down to the following code snippet and I cannot see what is wrong.

Given the string

s := string("\xc0")

The len(s) function returns 1. However, if you loop through the string the first rune is length 3.

	for _, r := range s {
		fmt.Println("len of rune:", utf8.RuneLen(r)) // Will print 3
	}

My assumptions are:

  • len(string) is returning the number of bytes in the string
  • utf8.RuneLen(r) is returning the number of bytes in the rune

I assume I am misunderstanding something, but how can the length of a string be less than the length of one of it's runes?

Playground here: https://go.dev/play/p/SH3ZI2IZyrL

答案1

得分: 2

解释很简单:你的输入不是有效的UTF-8编码字符串。

fmt.Println(utf8.ValidString(s))

这将输出:false

for range循环中,对于一个string,它会遍历字符串中的符文(runes),但如果遇到无效的UTF-8序列,将会为r设置Unicode替换字符0xFFFD规范:For语句:

> 对于一个字符串值,"range"子句会从字节索引0开始遍历字符串中的Unicode码点。在后续的迭代中,索引值将是字符串中连续的UTF-8编码码点的第一个字节的索引,第二个值(类型为rune)将是相应码点的值。如果迭代遇到无效的UTF-8序列,第二个值将是0xFFFD,即Unicode替换字符,并且下一次迭代将在字符串中前进一个字节。

这也适用于你的情况:对于使用UTF-8编码的3个字节的r,你得到的是0xfffd

如果你使用一个包含\xc0的有效字符串:

s = string([]rune{'\xc0'})

那么输出将是:

s的长度:2
s中的符文数:1
符文的长度:2
s的UTF-8字节:[195 128]
s的十六进制UTF-8字节:c3 80

Go Playground上试一试。

英文:

The explanation is simple: your input is not valid UTF-8 encoded string.

fmt.Println(utf8.ValidString(s))

This outputs: false.

The for range over a string ranges over its runes, but if an invalid UTF-8 sequence is encountered, the Unicode replacement character 0xFFFD is set for r. Spec: For statements:

> For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.

This applies to your case: you get 0xfffd for r which has 3 bytes using UTF-8 encoding.

If you go with a valid string holding a rune of \xc0:

s = string([]rune{'\xc0'})

Then output is:

len of s: 2
runes in s: 1
len of rune: 2
UTF-8 bytes of s: [195 128]
Hexa UTF-8 bytes of s: c3 80

Try it on the Go Playground.

huangapple
  • 本文由 发表于 2022年9月7日 22:10:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/73636971.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定