Go编译器如何知道字节切片中哪些字节应该被组合成一个rune?

huangapple go评论87阅读模式
英文:

How does the Go compiler know which bytes in a byte slice should be grouped together into one rune?

问题

func main() {
    byteSlice := []byte{226, 140, 152, 97, 98, 99}
    fmt.Println(string(byteSlice))
}

输出结果为:

⌘abc

在底层,Go语言是如何知道前三个字节(226, 140, 152)应该作为一个单独的uint32 rune()进行分组,而剩下的字节应该分别转换为三个独立的runes(abc)的?

英文:

Example:

func main() {
    byteSlice := []byte{226, 140, 152, 97, 98, 99}
    fmt.Println(string(byteSlice))
}

prints out:

⌘abc

Under the hood, how did Go know that the first three bytes - 226, 140, 152 - should be grouped together as a single uint32 rune: , while the remaining bytes should be converted to three separate runes: a, b, and c, respectively?

答案1

得分: 1

通过将UTF-8编码解码为UTF-32。

只需查看每个八位组的前导位,屏蔽掉标志位,并通过位移和按位或操作将数据位组合起来。

码点 ↔ UTF-8 转换

第一个码点 最后一个码点 字节1 字节2 字节3 字节4
U+0000 U+007F 0xxxxxxx
U+0080 U+07FF 110xxxxx 10xxxxxx
U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

然而,实际情况比这更复杂,因为涉及到各种Unicode规范化形式以及可能存在的组合标记(例如,e/U+0065后跟组合标记´/U+0301会得到é/U+00E9的Unicode码点(符文)。

有趣的是,如果你查看unicode/utf8的源代码,似乎DecodeRune()DecodeRuneInString()函数

可以看出,代码对于组合标记没有任何处理,这意味着它假设字符串中的八位组是以Unicode规范化形式C(规范分解后跟规范组合)表示的,因此你永远不会看到组合标记。

英文:

By decoding the UTF-8 encoding into UTF-32.

A simple matter of looking at the leading bits of each octet, masking out the sentinel bits, and combining the data bits with bit shifts and bitwise OR.

Code point ↔ UTF-8 conversion

First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
U+0000 U+007F 0xxxxxxx
U+0080 U+07FF 110xxxxx 10xxxxxx
U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

There's rather more to it than that, though, due to various Unicode normalization forms, and the possible presence of combining marks (e.g., e/U+0065 followed by the combining mark ´/U+0301 results in the Unicode code point (rune) for é/U+00E9.

Interestingly, if you look at the sources for unicode/utf8, it appears that DecodeRune() and DecodeRuneInString()

it would seem, as the code does nothing with respect to combining marks, that it has an underlying assumption that the octets in the string are in Unicode Normalization Form C (Canonical Decomposition followed by Canonical Composition), so you'd never see combining marks.

huangapple
  • 本文由 发表于 2022年8月12日 03:58:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/73326286.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定