英文:
How can I get the Unicode value of a character in go?
问题
我试图在Go中将字符串字符的Unicode值作为Int值获取。
我这样做:
value = strconv.Itoa(int(([]byte(char))[0]))
其中char包含一个只有一个字符的字符串。
这对许多情况都有效。但对于像ä、ö、ü、Ä、Ö、Ü这样的umlauts就不起作用。
例如,Ä的结果是65,与A相同。
我该怎么做?
补充说明:我有两个问题。第一个问题已经通过下面的任何答案解决了。第二个问题稍微棘手一些。我的输入不是Go规范化的UTF-8代码,例如umlauts由两个字符表示而不是一个字符。正如ANisus所说,解决方案可以在包golang.org/x/text/unicode/norm中找到。上面的代码现在变成了两行:
rune, _ := utf8.DecodeRune(norm.NFC.Bytes([]byte(char)))
value = strconv.Itoa(int(rune))
欢迎提供任何缩短代码的提示...
英文:
I try to get the unicode value of a string character in Go as an Int value.
I do this:
value = strconv.Itoa(int(([]byte(char))[0]))
where char contains a string with one character.
That works for many cases. It doesn't work for umlauts like ä, ö, ü, Ä, Ö, Ü.
E.g. Ä results in 65, which is the same as for A.
How can I do that?
Supplement: I had two problems. The first was solved with any of the answers below. The second was a bit more tricky. My input was not Go normalized UTF-8 code, e.g. umlauts were represented by two characters instead of one. As ANisus said the solution is found in the package golang.org/x/text/unicode/norm. The line above is now two lines:
rune, _ := utf8.DecodeRune(norm.NFC.Bytes([]byte(char)))
value = strconv.Itoa(int(rune))
Any hints to make this shorter welcome ...
答案1
得分: 11
字符串是UTF-8编码的,所以要解码字符串中的字符以获取rune
(Unicode代码点),可以使用unicode/utf8
包。
示例:
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
str := "AÅÄÖ"
for len(str) > 0 {
r, size := utf8.DecodeRuneInString(str)
fmt.Printf("%d %v\n", r, size)
str = str[size:]
}
}
结果:
65 1
197 2
196 2
214 2
编辑:(为了澄清Michael的补充)
诸如Ä
的字符可以使用不同的Unicode代码点创建:
预组合形式: Ä
(U+00C4)
使用组合分音符: A
(U+0041)+ ¨
(U+0308)
为了获得预组合形式,可以使用规范化包golang.org/x/text/unicode/norm
。NFC(规范分解,然后规范组合)形式将U+0041 + U+0308转换为U+00C4:
c := "\u0041\u0308"
r, _ := utf8.DecodeRune(norm.NFC.Bytes([]byte(c)))
fmt.Printf("%+q", r) // '\u00c4'
英文:
Strings are utf8 encoded, so to decode a character from a string to get the rune
(unicode code point), you can use the unicode/utf8
package.
Example:
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
str := "AÅÄÖ"
for len(str) > 0 {
r, size := utf8.DecodeRuneInString(str)
fmt.Printf("%d %v\n", r, size)
str = str[size:]
}
}
Result:
>65 1
>197 2
>196 2
>214 2
Edit: (To clarify Michael's supplement)
A character such as Ä
may be created using different unicode code points:
Precomposed: Ä
(U+00C4)
Using combining diaeresis: A
(U+0041) + ¨
(U+0308)
In order to get the precomposed form, one can use the normalization package, golang.org/x/text/unicode/norm
. The NFC (Canonical Decomposition,
followed by Canonical Composition) form will turn U+0041 + U+0308 into U+00C4:
c := "\u0041\u0308"
r, _ := utf8.DecodeRune(norm.NFC.Bytes([]byte(c)))
fmt.Printf("%+q", r) // '\u00c4'
答案2
得分: 8
在Go语言中,"character"类型是rune
,它是int32
的别名,也可以参考Rune literals。rune
是一个整数值,用于标识Unicode码点。
在Go中,string
以UTF-8编码的字节序列形式表示和存储文本。for
循环的range
形式用于迭代文本的rune
:
s := "你好世界"
for _, r := range s {
fmt.Printf("%c - %d\n", r, r)
}
输出结果:
你 - 20320
好 - 22909
世 - 19990
界 - 30028
你可以在Go Playground上尝试运行。
如果你想了解更多关于这个主题的内容,可以阅读这篇博文:
英文:
The "character" type in Go is the rune
which is an alias for int32
, see also Rune literals. A rune
is an integer value identifying a Unicode code point.
In Go string
s are represented and stored as the UTF-8 encoded byte sequence of the text. The range
form of the for
loop iterates over the rune
s of the text:
s := "äöüÄÖÜ世界"
for _, r := range s {
fmt.Printf("%c - %d\n", r, r)
}
Output:
ä - 228
ö - 246
ü - 252
Ä - 196
Ö - 214
Ü - 220
世 - 19990
界 - 30028
Try it on the Go Playground.
Read this blog article if you want to know more about the topic:
答案3
得分: 6
你可以使用unicode/utf8
包
rune,_:=utf8.DecodeRuneInString("Ä")
fmt.Println(rune)
英文:
you can use the unicode/utf8
package
rune,_:=utf8.DecodeRuneInString("Ä")
fmt.Println(rune)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论