
huangapple go评论81阅读模式

What determines the position of a character when looping through UTF-8 strings?


我正在阅读《Effective Go文档》中关于for语句的部分,并遇到了这个例子:

for pos, char := range "日本\x80語" {
    fmt.Printf("Character %#U, at position: %d\n", char, pos)


Character U+65E5 '日', at position: 0
Character U+672C '本', at position: 3
Character U+FFFD '�', at position: 6
Character U+8A9E '語', at position: 7

我不明白的是为什么位置是0、3、6和7。这告诉我第一个和第二个字符各占3个字节,而"replacement rune" (U+FFFD) 占1个字节,这一点我接受并理解。然而,我认为rune的类型是int32,因此每个字符应该占用4个字节,而不是三个。



I am reading the section on for statements in the Effective Go documentation and came across this example:

for pos, char := range "日本\x80語" {
	fmt.Printf("Character %#U, at position: %d\n", char, pos)

The output is:

Character U+65E5 '日', at position: 0
Character U+672C '本', at position: 3
Character U+FFFD '�', at position: 6
Character U+8A9E '語', at position: 7

What I don't understand is why the positions are 0, 3, 6, and 7. This tells me the first and second character is 3 bytes long and the 'replacement rune' (U+FFFD) is 1 byte long, which I accept and understand. However, I thought rune was of int32 type and therefore would be 4 bytes each, not three.

Why are the positions in a range different to the total amount of memory each value should be consuming?


得分: 6



当应用于string值时,for range结构会迭代stringrune


for range结构可能会产生1个或2个迭代值。在你的示例中使用2个值,如下所示:

for pos, char := range "日本\x80語" {
    fmt.Printf("Character %#U, at position: %d\n", char, pos)

对于每次迭代,pos将是rune/字符的字节索引,char将是stringrune。正如上面的引用所示,如果string是一个无效的UTF-8字节序列,当遇到无效的UTF-8序列时,char将是0xFFFD(Unicode替换字符),而for range结构(迭代)将仅前进一个字节



The Go Blog: Strings, bytes, runes and characters in Go


string values in Go are stored as read only byte slices ([]byte), where the bytes are the UTF-8 encoded bytes of the (runes of the) string. UTF-8 is a variable-length encoding, different Unicode code points may be encoded using different number of bytes. For example values in the range 0..127 are encoded as a single byte (whose value is the unicode codepoint itself), but values greater than 127 use more than 1 byte. The unicode/utf8 package contains UTF-8 related utility functions and constants, for example utf8.UTFMax reports the maximum number of bytes a valid Unicode codepoint may "occupy" in UTF-8 encoding (which is 4).

One thing to note here: not all possible byte sequences are valid UTF-8 sequences. A string may be any byte sequence, even those that are invalid UTF-8 sequences. For example the string value "\xff" represents an invalid UTF-8 byte sequence, for details, see https://stackoverflow.com/questions/30731687/how-do-i-represent-an-optional-string-in-go/30741287#30741287

The for range construct –when applied on a string value– iterates over the runes of the string:

> For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.

The for range construct may produce 1 or 2 iteration values. When using 2, like in your example:

for pos, char := range "日本\x80語" {
	fmt.Printf("Character %#U, at position: %d\n", char, pos)

For each iteration, pos will be byte index of the rune / character, and char will be the rune of the string. As you can see in the quote above, if the string is an invalid UTF-8 byte sequence, when an invalid UTF-8 sequence is encountered, char will be 0xFFFD (the Unicode replacement character), and the for range construct (the iteration) will advance a singe byte only.

To sum it up: The position is always the byte index of the rune of the current iteration (or more specifically: the byte index of the first byte of the UTF-8 encoded sequence of the rune of the current iteration), but if invalid UTF-8 sequence is encountered, the position (index) will only be incremented by 1 in the next iteration.

A must-read blog post if you want to know more about the topic:

The Go Blog: Strings, bytes, runes and characters in Go


得分: 0





rune is code point. Code point is just integer. You can even use int64 to store it if you want to. (But Unicode only has 1,114,112 code points so int32 should be the right choice. No wonder rune is alias of int32 in Golang.)

Different encoding schemes encode code points in different ways. E.g. CJK character is usually encoded to 3 bytes in UTF-8, and to 2 bytes in UTF-16.

String literal in Golang is UTF-8.

  • 本文由 发表于 2017年1月21日 20:07:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/41779147.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
