当循环遍历UTF-8字符串时,字符的位置是由什么决定的?

huangapple go评论81阅读模式
英文:

What determines the position of a character when looping through UTF-8 strings?

问题

我正在阅读《Effective Go文档》中关于for语句的部分,并遇到了这个例子:

for pos, char := range "日本\x80語" {
    fmt.Printf("Character %#U, at position: %d\n", char, pos)
}

输出结果为:

Character U+65E5 '日', at position: 0
Character U+672C '本', at position: 3
Character U+FFFD '�', at position: 6
Character U+8A9E '語', at position: 7

我不明白的是为什么位置是0、3、6和7。这告诉我第一个和第二个字符各占3个字节,而"replacement rune" (U+FFFD) 占1个字节,这一点我接受并理解。然而,我认为rune的类型是int32,因此每个字符应该占用4个字节,而不是三个。

为什么在range循环中的位置与每个值应该占用的总内存量不同?

英文:

I am reading the section on for statements in the Effective Go documentation and came across this example:

for pos, char := range "日本\x80語" {
	fmt.Printf("Character %#U, at position: %d\n", char, pos)
}

The output is:

Character U+65E5 '日', at position: 0
Character U+672C '本', at position: 3
Character U+FFFD '�', at position: 6
Character U+8A9E '語', at position: 7

What I don't understand is why the positions are 0, 3, 6, and 7. This tells me the first and second character is 3 bytes long and the 'replacement rune' (U+FFFD) is 1 byte long, which I accept and understand. However, I thought rune was of int32 type and therefore would be 4 bytes each, not three.

Why are the positions in a range different to the total amount of memory each value should be consuming?

答案1

得分: 6

在Go语言中,string类型的值以只读的字节切片([]byte)的形式存储,其中字节是string的(rune的)UTF-8编码字节。UTF-8是一种可变长度编码,不同的Unicode码点可以使用不同数量的字节进行编码。例如,范围在0..127内的值被编码为单个字节(其值为Unicode码点本身),而大于127的值则使用多个字节。unicode/utf8包含与UTF-8相关的实用函数和常量,例如utf8.UTFMax报告了一个有效的Unicode码点在UTF-8编码中所占用的最大字节数(为4)。

需要注意的是,并非所有可能的字节序列都是有效的UTF-8序列。一个string可以是任何字节序列,甚至包括无效的UTF-8序列。例如,string"\xff"表示一个无效的UTF-8字节序列,详情请参见https://stackoverflow.com/questions/30731687/how-do-i-represent-an-optional-string-in-go/30741287#30741287

当应用于string值时,for range结构会迭代stringrune

对于一个string值,"range"子句从字节索引0开始迭代字符串中的Unicode码点。在后续的迭代中,索引值将是字符串中连续的UTF-8编码码点的第一个字节的索引,第二个值(类型为rune)将是相应码点的值。如果迭代遇到无效的UTF-8序列,第二个值将是0xFFFD,即Unicode替换字符,并且下一次迭代将在字符串中前进一个字节。

for range结构可能会产生1个或2个迭代值。在你的示例中使用2个值,如下所示:

for pos, char := range "日本\x80語" {
    fmt.Printf("Character %#U, at position: %d\n", char, pos)
}

对于每次迭代,pos将是rune/字符的字节索引,char将是stringrune。正如上面的引用所示,如果string是一个无效的UTF-8字节序列,当遇到无效的UTF-8序列时,char将是0xFFFD(Unicode替换字符),而for range结构(迭代)将仅前进一个字节

**总结一下:**位置始终是当前迭代的rune的字节索引(或更具体地说:当前迭代的rune的UTF-8编码序列的第一个字节的字节索引),但如果遇到无效的UTF-8序列,位置(索引)在下一次迭代中仅增加1。

如果你想了解更多关于这个主题的内容,这是一篇必读的博文:

The Go Blog: Strings, bytes, runes and characters in Go

英文:

string values in Go are stored as read only byte slices ([]byte), where the bytes are the UTF-8 encoded bytes of the (runes of the) string. UTF-8 is a variable-length encoding, different Unicode code points may be encoded using different number of bytes. For example values in the range 0..127 are encoded as a single byte (whose value is the unicode codepoint itself), but values greater than 127 use more than 1 byte. The unicode/utf8 package contains UTF-8 related utility functions and constants, for example utf8.UTFMax reports the maximum number of bytes a valid Unicode codepoint may "occupy" in UTF-8 encoding (which is 4).

One thing to note here: not all possible byte sequences are valid UTF-8 sequences. A string may be any byte sequence, even those that are invalid UTF-8 sequences. For example the string value "\xff" represents an invalid UTF-8 byte sequence, for details, see https://stackoverflow.com/questions/30731687/how-do-i-represent-an-optional-string-in-go/30741287#30741287

The for range construct –when applied on a string value– iterates over the runes of the string:

> For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.

The for range construct may produce 1 or 2 iteration values. When using 2, like in your example:

for pos, char := range "日本\x80語" {
	fmt.Printf("Character %#U, at position: %d\n", char, pos)
}

For each iteration, pos will be byte index of the rune / character, and char will be the rune of the string. As you can see in the quote above, if the string is an invalid UTF-8 byte sequence, when an invalid UTF-8 sequence is encountered, char will be 0xFFFD (the Unicode replacement character), and the for range construct (the iteration) will advance a singe byte only.

To sum it up: The position is always the byte index of the rune of the current iteration (or more specifically: the byte index of the first byte of the UTF-8 encoded sequence of the rune of the current iteration), but if invalid UTF-8 sequence is encountered, the position (index) will only be incremented by 1 in the next iteration.

A must-read blog post if you want to know more about the topic:

The Go Blog: Strings, bytes, runes and characters in Go

答案2

得分: 0

rune是代码点。代码点只是整数。如果你愿意,甚至可以使用int64来存储它。(但是Unicode只有1,114,112个代码点,所以int32应该是正确的选择。难怪在Golang中runeint32的别名。)

不同的编码方案以不同的方式对代码点进行编码。例如,CJK字符通常在UTF-8中编码为3个字节,在UTF-16中编码为2个字节。

在Golang中,字符串字面量是UTF-8编码的。

英文:

rune is code point. Code point is just integer. You can even use int64 to store it if you want to. (But Unicode only has 1,114,112 code points so int32 should be the right choice. No wonder rune is alias of int32 in Golang.)

Different encoding schemes encode code points in different ways. E.g. CJK character is usually encoded to 3 bytes in UTF-8, and to 2 bytes in UTF-16.

String literal in Golang is UTF-8.

huangapple
  • 本文由 发表于 2017年1月21日 20:07:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/41779147.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定