英文:
How to combine two code points to get one?
问题
我知道Á
的Unicode代码点是U+00C1
。我在互联网上阅读了许多论坛和文章,得知我也可以通过组合字符´
(Unicode:U+00B4
)和A
(Unicode:U+0041
)来得到Á
。
我的问题很简单。如何实现这个呢?我尝试了以下方法。我决定在golang中尝试,但如果有人知道如何在Python(或其他编程语言)中实现,也完全可以。对我来说无所谓。
好的,我接下来尝试了以下步骤。
A
的二进制表示为:01000001
´
的二进制表示为:10110100
它们一起占用了15位,所以我需要UTF-8的3字节格式(1110xxxx 10xxxxxx 10xxxxxx
)
通过将A
和´
(首先是A)的位填充到x的位置,得到了以下结果:11100100 10000110 10110100
。
然后,我将得到的三个字节转换回十六进制值:E4 86 B4
。
然而,当我尝试在代码中写入它时,得到了一个完全不同的字符。换句话说,我的解决方案并没有按照我预期的工作。
package main
import (
"fmt"
)
func main() {
r := "\xE4\x86\xB4"
fmt.Println(r) // 它输出的是䆴而不是Á
}
英文:
I know that unicode code point for Á
is U+00C1
. I read on internet and many forums and articles that I can also make an Á
by combining characters ´
(unicode: U+00B4
) and A
(unicode: U+0041
).
My question is simple. How to do it? I tried something like this. I decided to try it in golang, but it's perfectly fine if someone knows how to do it in python (or some other programming language). It doesn't matter to me.
Okay, so I tried next.
A
in binary is: 01000001
´
in binary is: 10110100
It together takes 15 bits, so I need UTF-8 3 bytes format (1110xxxx 10xxxxxx 10xxxxxx
)
By filling the bits from A
and ´
(first A) in the places of x, the following is obtained: 11100100 10000110 10110100
.
Then I converted the resulting three bytes back into hexadecimal values: E4 86 B4
.
However, when I tried to write it in code, I got a completely different character. In other words, my solution is not working as I expected.
package main
import (
"fmt"
)
func main() {
r := "\xE4\x86\xB4"
fmt.Println(r) // It wrote 䆴 instead of Á
}
答案1
得分: 3
看起来你提供的´
(U+00B4)字符实际上不是Unicode定义的组合字符。
如果我们使用◌́
(U+0301)代替,我们可以将它与字符A
顺序放置,得到预期的输出:
>>> "A\u0301"
'Á'
除非我误解了你的意思,否则在这里不需要进行任何二进制操作或诡计。
英文:
It looks like the ´
(U+00B4) character you provided is not actually a combining character as Unicode defines it.
>>> "A\u00b4"
'A´'
If we use ◌́
(U+0301) instead, then we can just place it in sequence with a character like A
and get the expected output:
>>> "A\u0301"
'Á'
Unless I'm misunderstanding what you mean, it doesn't look like any binary manipulation or trickery is necessary here.
答案2
得分: 1
如StardustGogeta在他们的回答中解释的那样,用于表示“重音符”的正确组合Unicode字符是U+0301(组合重音符)。
但在Go语言中,由单个U+00C1(带重音符的拉丁大写字母A)字符组成的字符串与由U+0041(拉丁大写字母A)后跟U+0301(组合重音符)字符组成的字符串不相等。
如果你想比较字符串,你需要将它们都规范化为相同的规范化形式。详细信息请参见博文Go中的文本规范化。
以下代码片段展示了如何实现:
package main
import (
"fmt"
"golang.org/x/text/unicode/norm"
)
func main() {
combined := "\u00c1"
combining := "A\u0301"
fmt.Printf("combined = %s, combining = %s\n", combined, combining)
fmt.Printf("combined == combining: %t\n", combined == combining)
combiningNormalised := string(norm.NFC.Bytes([]byte(combining)))
fmt.Printf("combined == combiningNormalised: %t\n", combined == combiningNormalised)
}
输出结果:
combined = Á, combining = Á
combined == combining: false
combined == combiningNormalised: true
英文:
As StardustGogeta explains in their answer, the correct combining unicode character for an "acute" accent is U+0301 (Combining Acute Accent).
But in Go, a string consisting of the single U+00C1 (Latin Capital Letter A with Acute) character is not equal to a string consisting of a U+0041 (Latin Capital Letter A) followed by a U+0301 (Combining Acute Accent)
If you want to compare strings, you need to normalise both to the same normalisation form. See blog post Text normalization in Go for more details.
The following code snippet shows how to do that:
package main
import (
"fmt"
"golang.org/x/text/unicode/norm"
)
func main() {
combined := "\u00c1"
combining := "A\u0301"
fmt.Printf("combined = %s, combining = %s\n", combined, combining)
fmt.Printf("combined == combining: %t\n", combined == combining)
combiningNormalised := string(norm.NFC.Bytes([]byte(combining)))
fmt.Printf("combined == combiningNormalised: %t\n", combined == combiningNormalised)
}
Output:
combined = Á, combining = Á
combined == combining: false
combined == combiningNormalised: true
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论