2022年9月7日 22:10:53go评论85阅读模式

英文:

Go string appears shorter than it's first rune

问题

我在我的代码上运行了一些模糊测试，并发现了一个错误。我已经将其简化为以下代码片段，但我无法看出问题在哪里。

给定字符串

s := string("\xc0")

len(s) 函数返回 1。然而，如果你循环遍历字符串，第一个符文的长度是 3。

	for _, r := range s {
		fmt.Println("len of rune:", utf8.RuneLen(r)) // 将打印 3
	}

我的假设是：

len(string) 返回字符串中的字节数
utf8.RuneLen(r) 返回符文中的字节数

我猜我对某些事情有误解，但是一个字符串的长度怎么可能小于其中一个符文的长度呢？

Playground 链接：https://go.dev/play/p/SH3ZI2IZyrL

英文:

I was running some fuzzing on my code and it found a bug. I have reduced it down to the following code snippet and I cannot see what is wrong.

Given the string

s := string(&quot;\xc0&quot;)

The len(s) function returns 1. However, if you loop through the string the first rune is length 3.

	for _, r := range s {
		fmt.Println(&quot;len of rune:&quot;, utf8.RuneLen(r)) // Will print 3
	}

My assumptions are:

len(string) is returning the number of bytes in the string
utf8.RuneLen(r) is returning the number of bytes in the rune

I assume I am misunderstanding something, but how can the length of a string be less than the length of one of it's runes?

Playground here: https://go.dev/play/p/SH3ZI2IZyrL

答案1

得分: 2

解释很简单：你的输入不是有效的UTF-8编码字符串。

fmt.Println(utf8.ValidString(s))

这将输出：false。

在for range循环中，对于一个string，它会遍历字符串中的符文（runes），但如果遇到无效的UTF-8序列，将会为r设置Unicode替换字符0xFFFD。规范：For语句：

> 对于一个字符串值，"range"子句会从字节索引0开始遍历字符串中的Unicode码点。在后续的迭代中，索引值将是字符串中连续的UTF-8编码码点的第一个字节的索引，第二个值（类型为rune）将是相应码点的值。如果迭代遇到无效的UTF-8序列，第二个值将是0xFFFD，即Unicode替换字符，并且下一次迭代将在字符串中前进一个字节。

这也适用于你的情况：对于使用UTF-8编码的3个字节的r，你得到的是0xfffd。

如果你使用一个包含\xc0的有效字符串：

s = string([]rune{'\xc0'})

那么输出将是：

s的长度：2
s中的符文数：1
符文的长度：2
s的UTF-8字节：[195 128]
s的十六进制UTF-8字节：c3 80

在Go Playground上试一试。

英文:

The explanation is simple: your input is not valid UTF-8 encoded string.

fmt.Println(utf8.ValidString(s))

This outputs: false.

The for range over a string ranges over its runes, but if an invalid UTF-8 sequence is encountered, the Unicode replacement character 0xFFFD is set for r. Spec: For statements:

> For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.

This applies to your case: you get 0xfffd for r which has 3 bytes using UTF-8 encoding.

If you go with a valid string holding a rune of \xc0:

s = string([]rune{&#39;\xc0&#39;})

Then output is:

len of s: 2
runes in s: 1
len of rune: 2
UTF-8 bytes of s: [195 128]
Hexa UTF-8 bytes of s: c3 80

Try it on the Go Playground.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Go字符串似乎比它的第一个符文要短。

问题

答案1

使用Gorm时出现undefined: mysql.Open错误。

如何以惯用方式更改Go语言中的GitHub导入路径？

如何确保由通道和映射组成的结构以引用方式传递？

How can I convert []string to []namedstring

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论