What are Unicode codepoint types for?

huangapple go评论200阅读模式

What are Unicode codepoint types for?


我最近阅读了《UTF-8无处不在宣言》(UTF-8 Everywhere manifesto),这是一份主张默认使用UTF-8处理文本的文件。该宣言认为Unicode代码点并不是一个普遍有用的概念,除了专门处理文本的程序/库之外,不应直接与之交互。




I recently read the UTF-8 Everywhere manifesto, a document arguing for handling text with UTF-8 by default. The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.

However, some modern languages that use the UTF-8 default have built-in codepoint types, such as rune in Go and char in Rust.

What are these types actually useful for? Are they legacy from times before the meaninglessness of codepoints was broadly understood? Or is that an incomplete perspective?


得分: 1


首先,关于码点。我们使用术语码点是因为它很简单,它表示一个数字(),并且不容易与其他术语混淆。Unicode告诉我们它并不一致地使用术语码点字符,但这并不是一个问题:上下文是清楚的,并且它们通常是可以互换的(但对于一些不是字符的码点,如代理项和一些保留的码点除外)。注意:Unicode主要涉及字符,ISO 10646主要涉及码点。因此,最初的ISO是关于一个带有数字(码点)和名称的表格,而Unicode则涉及字符的属性。因此,我们可以在Unicode更好的地方使用码点,但字符很容易与C语言的char混淆,以及字体字形。






Texts have many different meaning and usages, so the question is difficult to answer.

First: about codepoint. We uses the term codepoint because it is easy, it implies a number (code), and not really confuseable with other terms. Unicode tell us that it doesn't use the term codepoint and character in a consistent way, but also that it is not a problem: context is clear, and they are often interchangeable (but for few codepoints which are not characters, like surrogates, and few reserved codepoints). Note: Unicode is mostly about characters, and ISO 10646 was most about codepoints. So original ISO was about a table with numbers (codepoint) and names, and Unicode about properties of characters. So we may use codepoints where Unicode character should be better, but character is easy confuseable with C char, and with font glyphs/graphemes.

Codepoints are one basic unit, so useful for most of programs, e.g. to store in databases, to exchange to other programs, to save files, for sorting, etc. For this exact reasons program languages uses the codepoint as type. UTF-8 code units may be an alternative, but it would be more difficult to navigate (see a UTF-8 as a tape disk where you should read sequentially, and codepoint text as an hard disk where you can just in middle of a text). Not a 100% appropriate, because you may need some context bytes. If you are getting user text, your program probably do not need to split in graphemes, to do liguatures, etc. if it will just store the data in a database. Codepoint is really low level and so fast for most operations.

The other part of text: displaying (or speech). This part is very complex, because we have many different scripts with very different rules, and then different languages with own special cases. So we needs a series of libraries, e.g. text layout (so word separation, etc. like pango), sharper engine (to find which glyph to use, combining characters, where to put next characters, e.g. HarfBuzz), and a font library which display the font (cairo plus freetype). it is complex, but most programmers do not need special handling: just reading text from database and sent to screen, so we just uses the relevant library (and it depends on operating system), and just going on. It is too complex for a language specification (and also a moving target, maybe in 30 years things are more standardized). So it is complex, and with many operation, so we may use complex structures (array of array of codepoint: so array of graphemes): not much a slow down. Note: fonts have codepoint tables to perform various operation before to find the glyph index. Various API uses Unicode strings (as codepoint array, UTF-16, UTF-8, etc.).

Naturally things are more complex, and it requires a lot of knowledge of different part of Unicode, if you are trying to program an editor (WYSIWYG, but also with terminals): you mix both worlds, and you need much more information (e.g. for selection of text). But in this case you must create your own structures.

And really: things are complex: do you want to just show first x characters on your blog? (maybe about assessment), or split at words (some language are not so linear, so the interpretation may be very wrong). For now just humans can do a good job for all languages, so also not yet need to a supporting type in different languages.


得分: 0



  • 一些抽象字符可以由不同的代码点进行编码;U+03A9(希腊大写字母欧米伽)和U+2126(欧姆符号)都对应于相同的抽象字符“Ω”,因此必须以相同的方式处理它们。
  • 此外,对于某些抽象字符,除了单个编码字符形式之外,还存在使用多个代码点表示的形式。抽象字符“ǵ”可以由单个代码点U+01F5(带重音的拉丁小写字母g)或序列<U+0067(拉丁小写字母g)、U+0301(组合重音符)>进行编码。


  • = U+FE69(小型美元符号)
  • = U+FF04(全角美元符号)
  • &#128178; = U+1F4B2(粗体美元符号)



  • &#225; = U+00E1(带重音的拉丁小写字母a),也称为“预组合”
  • = 由U+0061(拉丁小写字母a)U+0301(组合重音符)组合而成。在文本编辑器中,尝试删除(从右侧)通常会先删除重音符号。搜索任一变体都应该找到两种变体。



  • A = U+0041(拉丁大写字母A)
  • Α = U+0391(希腊大写字母阿尔法)
  • А = U+0410(西里尔大写字母A)



  • = U+4ECA
  • = U+5165
  • = U+624D

作为程序员处理代码点是有合理理由的。支持这些代码点的编程语言可能(或可能不)支持正确的编码(UTF-8 vs. UTF-16 vs. ISO-8859-1),并且可能(或可能不)正确生成UTF-16的代理项。从文本角度来看,用户不应关注代码点,尽管这有助于他们区分同形异义字。


> The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.

Where? It merely outlines advantages and disadvantages of code points. Two examples are:

> Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character Ω, and must be treated identically.

> Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>.

In other words: code points just index which graphemes Unicode supports.

  • Sometimes they're meant as single characters: one prominent example would be (EURO SIGN), having only the code point U+20AC.

  • Sometimes the same character has multiple code-points as per context: the dollar sign exists as:

    • &#128178; = U+1F4B2 (HEAVY DOLLAR SIGN)

    Storage wise when searching for one variant you might want to match all 3 variants instead on relying on the exact code point only.

  • Sometimes multiple code points can be combined to form up a single character:

    • &#225; = U+00E1 (LATIN SMALL LETTER A WITH ACUTE), also termed "precomposed"
    • = combination of U+0061 (LATIN SMALL LETTER A) and U+0301 (COMBINING ACUTE ACCENT) - in a text editor trying to delete (from the right side) will mostly result in actually deleting the acute accent first. Searching for either variant should find both variants.

    Storage wise you avoid to need searching for both variants by performing Unicode normalization, i.e. NFC to always favor precombined code points over two combined code points to form one character.

  • As for homoglyphs code points clearly distinguish the contextual meaning:

    Copy the greek or cyrillic character, then search this website for that letter - it will never find the other letters, no matter how similar they look. Likewise the latin letter A won't find the greek or cyrillic one.

  • Writing system wise code points can be used by multiple alphabets: the CJK portion is an attempt to use as few code points as possible while supporting as many languages as possible - Chinese (simplified, traditional, Hong Kong), Japanese, Korean, Vietnamese:

    • = U+4ECA
    • = U+5165
    • = U+624D

Dealing as a programmer with code points has valid reasons. Programming languages which support these may (or may not) support correct encodings (UTF-8 vs. UTF-16 vs. ISO-8859-1) and may (or may not) correctly produce surrogates for UTF-16. Text wise users should not be concerned about code points, although it would help them distinguishing homographs.

  • 本文由 发表于 2022年9月9日 22:03:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/73663349.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
