2022年8月12日 03:58:12go评论126阅读模式

英文:

How does the Go compiler know which bytes in a byte slice should be grouped together into one rune?

问题

func main() {
    byteSlice := []byte{226, 140, 152, 97, 98, 99}
    fmt.Println(string(byteSlice))
}

输出结果为：

⌘abc

在底层，Go语言是如何知道前三个字节（226, 140, 152）应该作为一个单独的uint32 rune（⌘）进行分组，而剩下的字节应该分别转换为三个独立的runes（a，b和c）的？

英文:

Example:

func main() {
    byteSlice := []byte{226, 140, 152, 97, 98, 99}
    fmt.Println(string(byteSlice))
}

prints out:

⌘abc

Under the hood, how did Go know that the first three bytes - 226, 140, 152 - should be grouped together as a single uint32 rune: ⌘, while the remaining bytes should be converted to three separate runes: a, b, and c, respectively?

答案1

得分: 1

通过将UTF-8编码解码为UTF-32。

只需查看每个八位组的前导位，屏蔽掉标志位，并通过位移和按位或操作将数据位组合起来。

码点 ↔ UTF-8 转换

第一个码点	最后一个码点	字节1	字节2	字节3	字节4
U+0000	U+007F	0xxxxxxx
U+0080	U+07FF	110xxxxx	10xxxxxx
U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

然而，实际情况比这更复杂，因为涉及到各种Unicode规范化形式以及可能存在的组合标记（例如，e/U+0065后跟组合标记´/U+0301会得到é/U+00E9的Unicode码点（符文）。

有趣的是，如果你查看unicode/utf8的源代码，似乎DecodeRune()和DecodeRuneInString()函数

可以看出，代码对于组合标记没有任何处理，这意味着它假设字符串中的八位组是以Unicode规范化形式C（规范分解后跟规范组合）表示的，因此你永远不会看到组合标记。

英文:

By decoding the UTF-8 encoding into UTF-32.

A simple matter of looking at the leading bits of each octet, masking out the sentinel bits, and combining the data bits with bit shifts and bitwise OR.

Code point ↔ UTF-8 conversion

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
U+0000	U+007F	0xxxxxxx
U+0080	U+07FF	110xxxxx	10xxxxxx
U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

There's rather more to it than that, though, due to various Unicode normalization forms, and the possible presence of combining marks (e.g., e/U+0065 followed by the combining mark ´/U+0301 results in the Unicode code point (rune) for é/U+00E9.

Interestingly, if you look at the sources for unicode/utf8, it appears that DecodeRune() and DecodeRuneInString()

it would seem, as the code does nothing with respect to combining marks, that it has an underlying assumption that the octets in the string are in Unicode Normalization Form C (Canonical Decomposition followed by Canonical Composition), so you'd never see combining marks.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Go编译器如何知道字节切片中哪些字节应该被组合成一个rune？

问题

答案1

无法在 macOS Sierra 上的 Gogland 1.0 EAP 中启动调试。

Gofiber框架模板的问题

你可以使用Golang如何获取容器日志？（错误）

使用Kite和Kontrol进行分布式微服务

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。