英文:
Can a character span multiple runes in Go?
问题
我在这篇博客上读到了这样的内容:
> 即使使用符文切片,一个字符可能跨越多个符文,例如,如果你有带重音符的字符。这种复杂和模糊的“字符”性质是Go字符串表示为字节序列的原因。
这是真的吗?(看起来像是一个了解Go的人写的博客)。我在我的机器上进行了测试,"è" 是一个符文和两个字节。而且Go文档似乎说的是相反的。
你遇到过这样的字符吗?(UTF-8编码)在Go中,一个字符可以跨越多个符文吗?
英文:
I read this on this blog
> Even with rune slices a single character might span multiple runes, which can happen if you have characters with grave accent, for example. This complicated and ambiguous nature of "characters" is the reason why Go strings are represented as byte sequences.
Is it true ? (it seems like a blog from someone who knows Go). I tested on my machine and "è" is 1 rune and 2 bytes. And the Go doc seems to say otherwise.
Have you encountered such characters ? (utf-8) Can a character span multiple runes in Go ?
答案1
得分: 8
是的,它可以:
s := "é́́"
fmt.Println(s, []rune(s))
输出结果(在Go Playground上尝试):
é́́ [101 769 769 769]
一个字符,4个符文。它可以是任意长...
这个例子来自Go博客:Go中的文本规范化。
什么是字符?
如同在字符串博客文章中提到的,字符可以跨越多个符文。例如,一个'e'和一个'◌́'(重音符号"\u0301")可以组合成'é'(在NFD中为"e\u0301")。这两个符文一起构成一个字符。字符的定义可能因应用程序而异。对于规范化,我们将其定义为以一个起始符文开头的符文序列,起始符文不会修改或与任何其他符文组合,后面可能是一个空序列的非起始符文,即会(通常是重音符号)。规范化算法逐个字符处理。
一个字符可以后跟任意数量的修饰符(修饰符可以重复和堆叠):
理论上,构成一个Unicode字符的符文数量是没有限制的。事实上,修饰符的数量也没有限制,修饰符可以重复或堆叠。你见过带有三个重音符的'e'吗?这里有一个例子:'é́́'。根据标准,这是一个完全有效的由4个符文组成的字符。
另请参阅:组合字符。
编辑: "这不会破坏'符文的概念'吗?"
回答:这不是符文的概念。rune
不是一个字符,它是一个标识Unicode码点的整数值。一个字符可能是一个Unicode码点,此时1个字符就是1个rune
。大多数情况下,rune
的一般用法符合这种情况,所以实际上这几乎不会引起任何困扰。这是Unicode标准的概念。
英文:
Yes it can:
s := "é́́"
fmt.Println(s, []rune(s))
Output (try it on the Go Playground):
é́́ [101 769 769 769]
One character, 4 runes. It may be arbitrary long...
Example taken from The Go Blog: Text Normalization in Go.
> What is a character?
>
> As was mentioned in the strings blog post, characters can span multiple runes. For example, an 'e' and '◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character. The definition of a character may vary depending on the application. For normalization we will define it as a sequence of runes that starts with a starter, a rune that does not modify or combine backwards with any other rune, followed by possibly empty sequence of non-starters, that is, runes that do (typically accents). The normalization algorithm processes one character at at time.
A character can be followed by any number of modifiers (modifiers can be repeated and stacked):
> Theoretically, there is no bound to the number of runes that can make up a Unicode character. In fact, there are no restrictions on the number of modifiers that can follow a character and a modifier may be repeated, or stacked. Ever seen an 'e' with three acutes? Here you go: 'é́́'. That is a perfectly valid 4-rune character according to the standard.
Also see: Combining character.
Edit: "Doesn't this kill the 'concept of runes'?"
Answer: It's not a concept of runes. A rune
is not a character. A rune is an integer value identifying a Unicode code point. A character may be one Unicode code point in which case 1 character is 1 rune
. Most of the general use of rune
s fits into this case, so in practice this hardly gives any headaches. It's a concept of the Unicode standard.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论