英文:
How does the Go compiler know which bytes in a byte slice should be grouped together into one rune?
问题
func main() {
byteSlice := []byte{226, 140, 152, 97, 98, 99}
fmt.Println(string(byteSlice))
}
输出结果为:
⌘abc
在底层,Go语言是如何知道前三个字节(226, 140, 152
)应该作为一个单独的uint32 rune(⌘
)进行分组,而剩下的字节应该分别转换为三个独立的runes(a
,b
和c
)的?
英文:
Example:
func main() {
byteSlice := []byte{226, 140, 152, 97, 98, 99}
fmt.Println(string(byteSlice))
}
prints out:
⌘abc
Under the hood, how did Go know that the first three bytes - 226, 140, 152
- should be grouped together as a single uint32 rune: ⌘
, while the remaining bytes should be converted to three separate runes: a
, b
, and c
, respectively?
答案1
得分: 1
通过将UTF-8编码解码为UTF-32。
只需查看每个八位组的前导位,屏蔽掉标志位,并通过位移和按位或操作将数据位组合起来。
码点 ↔ UTF-8 转换
第一个码点 | 最后一个码点 | 字节1 | 字节2 | 字节3 | 字节4 |
---|---|---|---|---|---|
U+0000 | U+007F | 0xxxxxxx | |||
U+0080 | U+07FF | 110xxxxx | 10xxxxxx | ||
U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
然而,实际情况比这更复杂,因为涉及到各种Unicode规范化形式以及可能存在的组合标记(例如,e
/U+0065后跟组合标记´
/U+0301会得到é
/U+00E9的Unicode码点(符文)。
有趣的是,如果你查看unicode/utf8
的源代码,似乎DecodeRune()
和DecodeRuneInString()
函数
-
https://cs.opensource.google/go/go/+/refs/tags/go1.19:src/unicode/utf8/utf8.go;l=151
-
https://cs.opensource.google/go/go/+/refs/tags/go1.19:src/unicode/utf8/utf8.go;l=199
可以看出,代码对于组合标记没有任何处理,这意味着它假设字符串中的八位组是以Unicode规范化形式C(规范分解后跟规范组合)表示的,因此你永远不会看到组合标记。
英文:
By decoding the UTF-8 encoding into UTF-32.
A simple matter of looking at the leading bits of each octet, masking out the sentinel bits, and combining the data bits with bit shifts and bitwise OR.
Code point ↔ UTF-8 conversion
First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|
U+0000 | U+007F | 0xxxxxxx | |||
U+0080 | U+07FF | 110xxxxx | 10xxxxxx | ||
U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
There's rather more to it than that, though, due to various Unicode normalization forms, and the possible presence of combining marks (e.g., e
/U+0065 followed by the combining mark ´
/U+0301 results in the Unicode code point (rune) for é
/U+00E9.
Interestingly, if you look at the sources for unicode/utf8
, it appears that DecodeRune()
and DecodeRuneInString()
-
https://cs.opensource.google/go/go/+/refs/tags/go1.19:src/unicode/utf8/utf8.go;l=151
-
https://cs.opensource.google/go/go/+/refs/tags/go1.19:src/unicode/utf8/utf8.go;l=199
it would seem, as the code does nothing with respect to combining marks, that it has an underlying assumption that the octets in the string are in Unicode Normalization Form C (Canonical Decomposition followed by Canonical Composition), so you'd never see combining marks.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论