What are Unicode codepoint types for?

huangapple go评论243阅读模式
英文:

What are Unicode codepoint types for?

问题

我最近阅读了《UTF-8无处不在宣言》(UTF-8 Everywhere manifesto),这是一份主张默认使用UTF-8处理文本的文件。该宣言认为Unicode代码点并不是一个普遍有用的概念,除了专门处理文本的程序/库之外,不应直接与之交互。

然而,一些使用UTF-8作为默认编码的现代语言,如Go中的rune和Rust中的char,内置了代码点类型。

这些类型实际上有什么用处?它们是在广泛理解代码点的无意义之前的遗留物吗?还是这只是一个不完整的观点?

英文:

I recently read the UTF-8 Everywhere manifesto, a document arguing for handling text with UTF-8 by default. The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.

However, some modern languages that use the UTF-8 default have built-in codepoint types, such as rune in Go and char in Rust.

What are these types actually useful for? Are they legacy from times before the meaninglessness of codepoints was broadly understood? Or is that an incomplete perspective?

答案1

得分: 1

文本具有许多不同的含义和用法,所以这个问题很难回答。

首先,关于码点。我们使用术语码点是因为它很简单,它表示一个数字(),并且不容易与其他术语混淆。Unicode告诉我们它并不一致地使用术语码点字符,但这并不是一个问题:上下文是清楚的,并且它们通常是可以互换的(但对于一些不是字符的码点,如代理项和一些保留的码点除外)。注意:Unicode主要涉及字符,ISO 10646主要涉及码点。因此,最初的ISO是关于一个带有数字(码点)和名称的表格,而Unicode则涉及字符的属性。因此,我们可以在Unicode更好的地方使用码点,但字符很容易与C语言的char混淆,以及字体字形。

码点是一个基本单位,因此对于大多数程序非常有用,例如存储在数据库中,与其他程序交换,保存文件,排序等等。出于这个确切的原因,程序语言使用码点作为类型。UTF-8码元可能是一个替代方案,但它会更难以导航(将UTF-8视为一个顺序读取的磁带磁盘,将码点文本视为一个可以在文本中间进行操作的硬盘)。这并不是一个100%准确的比喻,因为你可能需要一些上下文字节。如果你正在获取用户文本,如果你只是将数据存储在数据库中,你的程序可能不需要将其拆分为字形,进行连字等操作。码点对于大多数操作来说确实是非常底层和快速的。

文本的另一部分是显示(或语音)。这部分非常复杂,因为我们有许多不同的脚本,具有非常不同的规则,然后是具有自己特殊情况的不同语言。因此,我们需要一系列的库,例如文本布局(例如pango,用于单词分隔等),更精确的引擎(用于查找要使用的字形,组合字符,下一个字符的位置,例如HarfBuzz),以及显示字体的字体库(cairo加上freetype)。它很复杂,但大多数程序员不需要特殊处理:只需从数据库中读取文本并发送到屏幕,因此我们只使用相关的库(这取决于操作系统),然后继续进行。这对于语言规范来说太复杂了(而且也是一个不断变化的目标,也许在30年后事情会更加标准化)。因此,它是复杂的,并且有许多操作,因此我们可以使用复杂的结构(码点数组的数组:即字形的数组):不会有太大的减速。注意:字体具有码点表格,用于在找到字形索引之前执行各种操作。各种API使用Unicode字符串(作为码点数组,UTF-16,UTF-8等)。

当然,事情更加复杂,它需要对Unicode的不同部分有很多了解,如果你试图编写一个编辑器(所见即所得,但也包括终端):你混合了两个世界,并且需要更多的信息(例如用于选择文本)。但在这种情况下,你必须创建自己的结构。

而且真的:事情很复杂:你只想在博客上显示前x个字符吗?(也许是关于评估的),还是按单词拆分(某些语言不是那么线性,所以解释可能非常错误)。目前只有人类可以为所有语言做出良好的工作,因此还不需要在不同语言中支持的类型。

英文:

Texts have many different meaning and usages, so the question is difficult to answer.

First: about codepoint. We uses the term codepoint because it is easy, it implies a number (code), and not really confuseable with other terms. Unicode tell us that it doesn't use the term codepoint and character in a consistent way, but also that it is not a problem: context is clear, and they are often interchangeable (but for few codepoints which are not characters, like surrogates, and few reserved codepoints). Note: Unicode is mostly about characters, and ISO 10646 was most about codepoints. So original ISO was about a table with numbers (codepoint) and names, and Unicode about properties of characters. So we may use codepoints where Unicode character should be better, but character is easy confuseable with C char, and with font glyphs/graphemes.

Codepoints are one basic unit, so useful for most of programs, e.g. to store in databases, to exchange to other programs, to save files, for sorting, etc. For this exact reasons program languages uses the codepoint as type. UTF-8 code units may be an alternative, but it would be more difficult to navigate (see a UTF-8 as a tape disk where you should read sequentially, and codepoint text as an hard disk where you can just in middle of a text). Not a 100% appropriate, because you may need some context bytes. If you are getting user text, your program probably do not need to split in graphemes, to do liguatures, etc. if it will just store the data in a database. Codepoint is really low level and so fast for most operations.

The other part of text: displaying (or speech). This part is very complex, because we have many different scripts with very different rules, and then different languages with own special cases. So we needs a series of libraries, e.g. text layout (so word separation, etc. like pango), sharper engine (to find which glyph to use, combining characters, where to put next characters, e.g. HarfBuzz), and a font library which display the font (cairo plus freetype). it is complex, but most programmers do not need special handling: just reading text from database and sent to screen, so we just uses the relevant library (and it depends on operating system), and just going on. It is too complex for a language specification (and also a moving target, maybe in 30 years things are more standardized). So it is complex, and with many operation, so we may use complex structures (array of array of codepoint: so array of graphemes): not much a slow down. Note: fonts have codepoint tables to perform various operation before to find the glyph index. Various API uses Unicode strings (as codepoint array, UTF-16, UTF-8, etc.).

Naturally things are more complex, and it requires a lot of knowledge of different part of Unicode, if you are trying to program an editor (WYSIWYG, but also with terminals): you mix both worlds, and you need much more information (e.g. for selection of text). But in this case you must create your own structures.

And really: things are complex: do you want to just show first x characters on your blog? (maybe about assessment), or split at words (some language are not so linear, so the interpretation may be very wrong). For now just humans can do a good job for all languages, so also not yet need to a supporting type in different languages.

答案2

得分: 0

这篇宣言认为Unicode代码点并不是一个普遍有用的概念,除了在专门处理文本的程序/库之外,不应直接与之交互。

宣言中列举了代码点的优点和缺点,其中有两个例子:

  • 一些抽象字符可以由不同的代码点进行编码;U+03A9(希腊大写字母欧米伽)和U+2126(欧姆符号)都对应于相同的抽象字符“Ω”,因此必须以相同的方式处理它们。
  • 此外,对于某些抽象字符,除了单个编码字符形式之外,还存在使用多个代码点表示的形式。抽象字符“ǵ”可以由单个代码点U+01F5(带重音的拉丁小写字母g)或序列<U+0067(拉丁小写字母g)、U+0301(组合重音符)>进行编码。

换句话说,代码点只是索引Unicode支持的字形。有时它们被用作单个字符,比如(欧元符号),只有一个代码点U+20AC。有时同一个字符根据上下文有多个代码点,比如美元符号存在以下几种编码形式:

  • = U+FE69(小型美元符号)
  • = U+FF04(全角美元符号)
  • &#128178; = U+1F4B2(粗体美元符号)

在存储时,如果要搜索其中一种变体,可能希望匹配所有3种变体,而不仅仅依赖于精确的代码点。

有时,多个代码点可以组合成一个字符:

  • &#225; = U+00E1(带重音的拉丁小写字母a),也称为“预组合”
  • = 由U+0061(拉丁小写字母a)U+0301(组合重音符)组合而成。在文本编辑器中,尝试删除(从右侧)通常会先删除重音符号。搜索任一变体都应该找到两种变体。

在存储时,通过执行Unicode规范化,即NFC,可以避免需要搜索两种变体,始终优先使用预组合的代码点而不是两个组合的代码点来形成一个字符。

至于同形异义字,代码点清楚地区分了上下文含义:

  • A = U+0041(拉丁大写字母A)
  • Α = U+0391(希腊大写字母阿尔法)
  • А = U+0410(西里尔大写字母A)

复制希腊字母或西里尔字母,然后在该网站上搜索该字母-无论它们看起来多么相似,都永远不会找到其他字母。同样,拉丁字母A也无法找到希腊字母或西里尔字母。

书写系统的角度来看,代码点可以被多个字母表使用:CJK部分是尽可能使用尽可能少的代码点来支持尽可能多的语言的尝试,包括简体中文、繁体中文、香港中文、日语、韩语、越南语

  • = U+4ECA
  • = U+5165
  • = U+624D

作为程序员处理代码点是有合理理由的。支持这些代码点的编程语言可能(或可能不)支持正确的编码(UTF-8 vs. UTF-16 vs. ISO-8859-1),并且可能(或可能不)正确生成UTF-16的代理项。从文本角度来看,用户不应关注代码点,尽管这有助于他们区分同形异义字。

英文:

> The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.

Where? It merely outlines advantages and disadvantages of code points. Two examples are:

> Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character Ω, and must be treated identically.

> Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>.

In other words: code points just index which graphemes Unicode supports.

  • Sometimes they're meant as single characters: one prominent example would be (EURO SIGN), having only the code point U+20AC.

  • Sometimes the same character has multiple code-points as per context: the dollar sign exists as:

    • = U+FE69 (SMALL DOLLAR SIGN)
    • = U+FF04 (FULLWIDTH DOLLAR SIGN)
    • &#128178; = U+1F4B2 (HEAVY DOLLAR SIGN)

    Storage wise when searching for one variant you might want to match all 3 variants instead on relying on the exact code point only.

  • Sometimes multiple code points can be combined to form up a single character:

    • &#225; = U+00E1 (LATIN SMALL LETTER A WITH ACUTE), also termed "precomposed"
    • = combination of U+0061 (LATIN SMALL LETTER A) and U+0301 (COMBINING ACUTE ACCENT) - in a text editor trying to delete (from the right side) will mostly result in actually deleting the acute accent first. Searching for either variant should find both variants.

    Storage wise you avoid to need searching for both variants by performing Unicode normalization, i.e. NFC to always favor precombined code points over two combined code points to form one character.

  • As for homoglyphs code points clearly distinguish the contextual meaning:

    Copy the greek or cyrillic character, then search this website for that letter - it will never find the other letters, no matter how similar they look. Likewise the latin letter A won't find the greek or cyrillic one.

  • Writing system wise code points can be used by multiple alphabets: the CJK portion is an attempt to use as few code points as possible while supporting as many languages as possible - Chinese (simplified, traditional, Hong Kong), Japanese, Korean, Vietnamese:

    • = U+4ECA
    • = U+5165
    • = U+624D

Dealing as a programmer with code points has valid reasons. Programming languages which support these may (or may not) support correct encodings (UTF-8 vs. UTF-16 vs. ISO-8859-1) and may (or may not) correctly produce surrogates for UTF-16. Text wise users should not be concerned about code points, although it would help them distinguishing homographs.

huangapple
  • 本文由 发表于 2022年9月9日 22:03:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/73663349.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定