在翻译字符集中,抽象字符是否与Unicode标量值不同?

huangapple go评论69阅读模式
英文:

Do abstract characters in the translation character set differ from Unicode scalar values?

问题

以下是您要翻译的部分:

考虑以下十六进制表示的字节序列(如果有ASCII解释,则在第二列作为阅读辅助):

0x73 s
0x74 t
0x61 a
0x74 t
0x69 i
0x63 c
0x5f _
0x61 a
0x73 s
0x73 s
0x65 e
0x72 r
0x74 t
0x28 (
0x55 U
0x27 '
0xe2
0x84
0xab
0x27 '
0x3d =
0x3d =
0x55 U
0x27 '
0xc3
0x85
0x27 '
0x29 )
0x3b ;

解码为UTF-8后,这个字节序列读取为:

static_assert(U'Å'==U'Å');

请注意,在左侧, 是Unicode标量值:

0x212B 安斯特朗符号

而在右侧,Å 是Unicode标量值:

0x00C5 拉丁大写字母A上面带圆圈

在C++23中,当将字节序列解释为强制支持的UTF-8编码时,断言是否应该失败?


在翻译阶段1中,将UTF-8序列解码为Unicode标量值序列后,应将这些标量值映射到翻译字符集的元素,以形成一系列翻译字符集元素,参见lex.phases/1.1。根据lex.charset/1.1,翻译字符集的元素是除未分配的标量值之外的“抽象字符”,这些字符具有分配的Unicode代码点。

我找到的与“抽象字符”最接近的定义在Unicode标准中。然而,根据其第3.4章节,一个“抽象字符”可以分配多个代码点,并且它以Angstrom字符为例。(编辑:仔细阅读后,它并没有说“分配”,只是“对应”)。

如果这是C++标准草案中所指的“抽象字符”的定义,那么翻译字符集中是否只应有一个元素,等效于由代码点0x212B和0x00C5表示的单个抽象字符?如果是这样,那么字符文字的值是否应该相同,因为该值源自于不保留任何有关原始标量值的信息的翻译字符集元素?

这对我来说似乎不是有意的。Unicode是否甚至提供了有关哪些代码点引用相同的“抽象字符”的完整信息?但是,标准草案中的“抽象字符”到底是什么意思呢?

英文:

Consider the following sequence of bytes in hexadecimal representation (ASCII interpretations, if any, in the second column as a reading aide):

0x73 s
0x74 t
0x61 a
0x74 t
0x69 i
0x63 c
0x5f _
0x61 a
0x73 s
0x73 s
0x65 e
0x72 r
0x74 t
0x28 (
0x55 U
0x27 '
0xe2
0x84
0xab
0x27 '
0x3d =
0x3d =
0x55 U
0x27 '
0xc3
0x85
0x27 '
0x29 )
0x3b ;

Decoded as UTF-8 this byte sequence reads

static_assert(U'Å'==U'Å');

Note that on the left side is Unicode scalar value

0x212B ANGSTROM SIGN

and on the right-hand side Å is Unicode scalar value

0x00C5 LATIN CAPITAL LETTER A WITH RING ABOVE

Is the assertion supposed to fail in C++23 when the byte sequence is interpreted as a source in the mandatorily-supported UTF-8 encoding?


In translation phase 1, after decoding the UTF-8 sequence to a Unicode scalar value sequence, these scalar values are supposed to be mapped to elements of the translation character set to form a sequence of translation character set elements, see [lex.phases]/1.1. According to [lex.charset]/1.1 the elements of the translation character set are, with the exception of unassigned scalar values, the abstract characters which have an assigned Unicode code point.

The closest definition I could find for abstract character is in the Unicode standard. However, according to its chapter 3.4. D11 an abstract character can be assigned multiple code points and it gives the Angstrom character as an example. (EDIT: Carefully reading again, it doesn't say "assigned", just "correspond to".)

If this is the definition of abstract character meant in the C++ standard draft, isn't there then supposed to be only one element in the translation character set which is equivalent to the single abstract character represented by both the code points 0x212B and 0x00C5? If so, shouldn't then the value of both character literals be the same since the value is derived from the translation character set element which doesn't retain any information about the original scalar value?

This does not seem intended to me. Does Unicode even provide complete information on which code points refer to the same abstract character? But then, what exactly is meant by abstract character in the standard draft?

答案1

得分: 5

这个问题实际上是关于“抽象字符”到底是什么意思。这由Unicode标准定义。

你引用了一个抽象字符可能映射到多个代码点,甚至代码点序列。

问题在于标准的其余部分似乎不同意这一点。

如果你查看Unicode表格(也在Unicode标准中定义),你会发现关于“U+212B”或“U+00C5”它们是否编码到相同的抽象字符并没有具体的规定。关于U+212B的条目说:

> • 首选表示是00C5 Å
> ≡ 00C5 Å 拥有上方圆环的拉丁大写字母a

然而,≡符号的定义是“规范分解映射”。如果你查看词汇表以了解更多,你会发现这并没有解释抽象字符是什么。

实际上,如果你在词汇表中查找,你可能会偶然发现“字符名称”的定义:

> 字符名称。用于标识标准中编码的每个抽象字符的唯一字符串。(参见3.3节中的定义D4,语义。)

因此,每个“在标准中编码的抽象字符”都有一个与之关联的“唯一字符串”。

因此,如果“U+212B”和“U+00C5”具有不同的“字符名称”属性,它们必须是不同的抽象字符。

而且,如果你在Unicode字符数据库中查找它们,你会发现它们实际上具有不同的“字符名称”。因此,它们是不同的“抽象字符”,它们具有不同的Unicode代码点,因此不相等。

这与Unicode标准中引用部分中给出的示例相矛盾。因此,问题在于Unicode标准本身不一致。定义映射的数据库与文本的一部分不一致。

也许这是标准中唯一声称多个代码点映射到同一个抽象字符的地方。

尽管如此,我认为C++标准应该使用术语“编码字符”而不是“抽象字符”。前者明确而明确地指的是分配给字符的特定代码点。注意,甚至“编码字符的定义”也不承认多个代码点映射到一个抽象字符的可能性:“在抽象字符和代码点之间。” 这两者都是单数

英文:

This question is really about what "abstract character" really means. That's defined by the Unicode standard.

You cited that an abstract character may map to multiple code points. Or even codepoint sequences.

The problem is that the rest of the standard doesn't seem to agree.

If you look at the Unicode tables (also defined in the Unicode standard), there is no specification on "U+212B" or "U+00C5" that they code to the same abstract character. The entry for U+212B says:

> • preferred representation is 00C5 Å
> ≡ 00C5 Å latin capital letter a with ring above

However, the ≡ symbol is defined to mean, "canonical decomposition mapping". And if you head to the glossary to look that up, you'll find that this says nothing about what the abstract character is.

In fact, if you look around the glossary, you may stumble upon the definition of "character name":

> Character Name. A unique string used to identify each abstract character encoded in the standard. (See definition D4 in Section 3.3, Semantics.)

So, every "abstract character" "encoded in the standard" has a "unique string" associated with it.

Therefore, if "U+212B" and "U+00C5" have different "character name" properties, they must be different abstract characters.

And if you look them up in the Unicode Character Database, they do in fact have different "character names". Ergo, they are different "abstract characters", which have different Unicode code-points and therefore do not compare equal.

This contradicts the example given in the quoted part of the Unicode standard. So the problem is that the Unicode standard itself is inconsistent. The database that defines the mapping is inconsistent with part of the text.

It may well be that this is the only place in the standard where it claims that multiple code points map to the same abstract character.

That being said, I would say that the C++ standard should use the term "encoded character" rather than "abstract character". The former clearly and unequivocally refers to a specific code point assigned to a character. Note that even the definition of "encoded character" does not recognize the possibility of multiple code points mapping to an abstract character: "between an abstract character and a code point." Those are both singular.

huangapple
  • 本文由 发表于 2023年6月9日 06:56:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76436196.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定