英文:
Different results using codepoint() with input arguments with \dot
问题
I am trying to see whether the \dot operator can be detected from a symbol in Julia, here is what I have tried:
The following two blocks return different results
julia> [codepoint(i) for i in string(:ẋ)]
1-element Vector{UInt32}:
0x00001e8b
julia> [codepoint(i) for i in "ẋ"]
2-element Vector{UInt32}:
0x00000078
0x00000307
Ideally I would have a symbol at the beginning, not a string, so I need to use the first method, but that will not return the 0x307 which is the unicode of \dot, making it hard to detect \dot.
So what is the mechanism behind the difference? Thank you.
英文:
I am trying to see whether the \dot operator can be detected from a symbol in Julia, here is what I have tried:
The following two blocks return different results
julia> [codepoint(i) for i in string(:ẋ)]
1-element Vector{UInt32}:
0x00001e8b
julia> [codepoint(i) for i in "ẋ"]
2-element Vector{UInt32}:
0x00000078
0x00000307
Ideally I would have a symbol at the beginning, not a string, so I need to use the first method, but that will not return the 0x307 which is the unicode of \dot, making it hard to detect \dot.
So what is the mechanism behind the difference? Thank you.
答案1
得分: 5
两个结果是等价的。
人类是复杂的,语言也是,因此Unicode需要有复杂的规则。
在你的情况下,你有两种表示:
- U+1E8B(带点的小写拉丁字母X)
- U+0087(小写拉丁字母X)+ U+0307(组合点号)
在Unicode中,这两种被认为是等价的。注意:当比较字符串时,最好对字符串进行规范化。不幸的是,有两种主要的规范化方式:
- NFD:规范分解形式,即第二种情况。如果可能的话,始终将字符分解为基本字符+修饰符。这种规范化由Apple首选,也是Unicode的最初设想。
- NFC:规范组合形式。如果有一种方法可以组合字符,就会这样做。如果有多种修饰符,则有制定规则来进行组合(因此有优先顺序)。大多数其他操作系统都偏向使用这种方法。
- 还有K版本(兼容性而非规范化),但这更棘手:存在各种兼容性的原因。因此,通常不用于显示,而是用于搜索文本。
请参阅https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
显示引擎(布局引擎、文本整形、字形显示、字体元数据)可能会生成相同的符号(每种字体对它们期望数据的规范化方式都有自己的偏好,但随后它们会尝试找到一个合并的字形)。
我认为在你的情况下,文本文件中可能有两种不同的变体。一种使用两个字符,另一种使用单个字符。在复制字符时经常会发生这种情况(某些编辑器偏好一种规范化而不是另一种)。
在你的情况下,我认为你应该规范化字符串,可以参考https://docs.julialang.org/en/v1/stdlib/Unicode/中的Unicode.normalize
。而且我们使用的是拉丁字符,因此属于Unicode的简单部分(但它是少数几种具有大写和小写的文字系统之一)。
英文:
Both results are equivalent.
Humans are complex, languages also, and so Unicode was required to have complex rules.
In your case you have two representation:
- U+1E8B (LATIN SMALL LETTER X WITH DOT ABOVE)
- U+0087 (LATIN SMALL LETTER X) + U+0307 (COMBINING DOT ABOVE)
Both are considered equivalent on Unicode. Note: when comparing strings, it is good to normalize strings. Unfortunately there are two main normalization:
- NFD: Normalization Form Canonical Decomposition, so the second case. If possible always decompose characters, into base + modifier). This normalization is preferred by Apple, and it was the original idea in Unicode.
- NFC: Normalization Form Canonical Composition. If there is a way to combine characters, it is done. There are rules on how to make it, if there are various modifiers (so which precedence). This method is preferred by most of other operating systems.
- and the K version (Compatibility instead of canonical), but it is more tricky: there are various reason for compatibility. So they are usually not used for display but for searching text).
See https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
The display engines (layout engine, text shapening, glyph display, font metadata) will probably make the same symbol (each font has own preference on which normalization they expect data, but then they will try to find a combined glyph).
I think in your case, you may have two different variant in the text file. One using two characters, and one with a single character. It happen often when copying characters (some editors prefer one normalization compared to the other).
In your case, I think you should normalize the string, see e.g. Unicode.normalize
in https://docs.julialang.org/en/v1/stdlib/Unicode/
And we are using Latin characters, so in the easy part of Unicode (but for being one of the few scripts with upper case and lower case).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论