英文:
What determines the position of a character when looping through UTF-8 strings?
问题
我正在阅读《Effective Go文档》中关于for
语句的部分,并遇到了这个例子:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
输出结果为:
Character U+65E5 '日', at position: 0
Character U+672C '本', at position: 3
Character U+FFFD '�', at position: 6
Character U+8A9E '語', at position: 7
我不明白的是为什么位置是0、3、6和7。这告诉我第一个和第二个字符各占3个字节,而"replacement rune" (U+FFFD) 占1个字节,这一点我接受并理解。然而,我认为rune
的类型是int32
,因此每个字符应该占用4个字节,而不是三个。
为什么在range
循环中的位置与每个值应该占用的总内存量不同?
英文:
I am reading the section on for
statements in the Effective Go documentation and came across this example:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
The output is:
Character U+65E5 '日', at position: 0
Character U+672C '本', at position: 3
Character U+FFFD '�', at position: 6
Character U+8A9E '語', at position: 7
What I don't understand is why the positions are 0, 3, 6, and 7. This tells me the first and second character is 3 bytes long and the 'replacement rune' (U+FFFD) is 1 byte long, which I accept and understand. However, I thought rune
was of int32
type and therefore would be 4 bytes each, not three.
Why are the positions in a range different to the total amount of memory each value should be consuming?
答案1
得分: 6
在Go语言中,string
类型的值以只读的字节切片([]byte
)的形式存储,其中字节是string
的(rune
的)UTF-8编码字节。UTF-8是一种可变长度编码,不同的Unicode码点可以使用不同数量的字节进行编码。例如,范围在0..127
内的值被编码为单个字节(其值为Unicode码点本身),而大于127的值则使用多个字节。unicode/utf8
包含与UTF-8相关的实用函数和常量,例如utf8.UTFMax
报告了一个有效的Unicode码点在UTF-8编码中所占用的最大字节数(为4)。
需要注意的是,并非所有可能的字节序列都是有效的UTF-8序列。一个string
可以是任何字节序列,甚至包括无效的UTF-8序列。例如,string
值"\xff"
表示一个无效的UTF-8字节序列,详情请参见https://stackoverflow.com/questions/30731687/how-do-i-represent-an-optional-string-in-go/30741287#30741287
当应用于string
值时,for range
结构会迭代string
的rune
:
对于一个
string
值,"range"子句从字节索引0开始迭代字符串中的Unicode码点。在后续的迭代中,索引值将是字符串中连续的UTF-8编码码点的第一个字节的索引,第二个值(类型为rune
)将是相应码点的值。如果迭代遇到无效的UTF-8序列,第二个值将是0xFFFD
,即Unicode替换字符,并且下一次迭代将在字符串中前进一个字节。
for range
结构可能会产生1个或2个迭代值。在你的示例中使用2个值,如下所示:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
对于每次迭代,pos
将是rune
/字符的字节索引,char
将是string
的rune
。正如上面的引用所示,如果string
是一个无效的UTF-8字节序列,当遇到无效的UTF-8序列时,char
将是0xFFFD
(Unicode替换字符),而for range
结构(迭代)将仅前进一个字节。
**总结一下:**位置始终是当前迭代的rune
的字节索引(或更具体地说:当前迭代的rune
的UTF-8编码序列的第一个字节的字节索引),但如果遇到无效的UTF-8序列,位置(索引)在下一次迭代中仅增加1。
如果你想了解更多关于这个主题的内容,这是一篇必读的博文:
The Go Blog: Strings, bytes, runes and characters in Go
英文:
string
values in Go are stored as read only byte slices ([]byte
), where the bytes are the UTF-8 encoded bytes of the (rune
s of the) string
. UTF-8 is a variable-length encoding, different Unicode code points may be encoded using different number of bytes. For example values in the range 0..127
are encoded as a single byte (whose value is the unicode codepoint itself), but values greater than 127 use more than 1 byte. The unicode/utf8
package contains UTF-8 related utility functions and constants, for example utf8.UTFMax
reports the maximum number of bytes a valid Unicode codepoint may "occupy" in UTF-8 encoding (which is 4).
One thing to note here: not all possible byte sequences are valid UTF-8 sequences. A string
may be any byte sequence, even those that are invalid UTF-8 sequences. For example the string
value "\xff"
represents an invalid UTF-8 byte sequence, for details, see https://stackoverflow.com/questions/30731687/how-do-i-represent-an-optional-string-in-go/30741287#30741287
The for range
construct –when applied on a string
value– iterates over the runes of the string
:
> For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune
, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD
, the Unicode replacement character, and the next iteration will advance a single byte in the string.
The for range
construct may produce 1 or 2 iteration values. When using 2, like in your example:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
For each iteration, pos
will be byte index of the rune / character, and char
will be the rune of the string
. As you can see in the quote above, if the string
is an invalid UTF-8 byte sequence, when an invalid UTF-8 sequence is encountered, char
will be 0xFFFD
(the Unicode replacement character), and the for range
construct (the iteration) will advance a singe byte only.
To sum it up: The position is always the byte index of the rune
of the current iteration (or more specifically: the byte index of the first byte of the UTF-8 encoded sequence of the rune
of the current iteration), but if invalid UTF-8 sequence is encountered, the position (index) will only be incremented by 1 in the next iteration.
A must-read blog post if you want to know more about the topic:
答案2
得分: 0
rune
是代码点。代码点只是整数。如果你愿意,甚至可以使用int64
来存储它。(但是Unicode只有1,114,112个代码点,所以int32
应该是正确的选择。难怪在Golang中rune
是int32
的别名。)
不同的编码方案以不同的方式对代码点进行编码。例如,CJK字符通常在UTF-8中编码为3个字节,在UTF-16中编码为2个字节。
在Golang中,字符串字面量是UTF-8编码的。
英文:
rune
is code point. Code point is just integer. You can even use int64
to store it if you want to. (But Unicode only has 1,114,112 code points so int32
should be the right choice. No wonder rune
is alias of int32 in Golang.)
Different encoding schemes encode code points in different ways. E.g. CJK character is usually encoded to 3 bytes in UTF-8, and to 2 bytes in UTF-16.
String literal in Golang is UTF-8.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论