英文:
RuneCountInString get invalid count when character contain 「U+FE0F」
问题
以下是要翻译的内容:
t := "👾️"
fmt.Println(utf8.RuneCountInString(t))
我认为打印计数等于1更好。为什么它返回2?
英文:
t := "🈶️"
fmt.Println(utf8.RuneCountInString(t))
i think print count == 1 is better. why it return 2
答案1
得分: 2
字符🈶️
由2个代码点(U+1F236和U+FE0F)表示。"代码点"这个词有点啰嗦,所以Go引入了一个更短的术语来表示这个概念:rune。utf8.RuneCountInString
返回runes的数量为2
,并且按预期工作。
如果你想计算字符的数量,可以尝试使用github.com/rivo/uniseg
包。
下面的示例代码应该能更好地解释它:
package main
import (
"fmt"
"unicode/utf8"
"github.com/rivo/uniseg"
)
func main() {
s1 := "🈶️" // UTF-8输入文本
s2 := "\U0001f236\ufe0f" // <== 显式的Unicode代码点
s3 := "\xf0\x9f\x88\xb6\xef\xb8\x8f" // 显式的UTF-8字节
fmt.Println("s1:", s1)
fmt.Println("s1 == s2:", s1 == s2)
fmt.Println("s2 == s3:", s2 == s3)
fmt.Println("len(s1):", len(s1), "bytes")
fmt.Println("runes:")
for pos, r := range s1 {
fmt.Printf(" %d: %X\n", pos, r)
}
fmt.Println("utf8.RuneCount:", utf8.RuneCount([]byte(s1)))
fmt.Println("utf8.RuneCountInString:", utf8.RuneCountInString(s1))
// GraphemeClusterCount返回给定字符串的用户感知字符(图形簇)的数量。
fmt.Println("uniseg.GraphemeClusterCount:", uniseg.GraphemeClusterCount(s1))
}
输出:
s1: 🈶️
s1 == s2: true
s2 == s3: true
len(s1): 7 bytes
runes:
0: 1F236
4: FE0F
utf8.RuneCount: 2
utf8.RuneCountInString: 2
uniseg.GraphemeClusterCount: 1
参考资料:
-
Rob Pike的优秀文章Go中的字符串、字节、runes和字符。
-
Go编程语言规范中的"字符串字面量"部分。
-
Henrique Vicente的博文:Go中的UTF-8字符串:len(s)不够用。
英文:
The character 🈶️
is represented by 2 code points (U+1F236 and U+FE0F). “Code point” is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. utf8.RuneCountInString
returns the number of runes 2
and works as expected.
Try the package github.com/rivo/uniseg
if you want to count the number of characters.
The demo below should explain it better:
package main
import (
"fmt"
"unicode/utf8"
"github.com/rivo/uniseg"
)
func main() {
s1 := "🈶️" // UTF-8 input text
s2 := "\U0001f236\ufe0f" // <== the explicit Unicode code points
s3 := "\xf0\x9f\x88\xb6\xef\xb8\x8f" // the explicit UTF-8 bytes
fmt.Println("s1:", s1)
fmt.Println("s1 == s2:", s1 == s2)
fmt.Println("s2 == s3:", s2 == s3)
fmt.Println("len(s1):", len(s1), "bytes")
fmt.Println("runes:")
for pos, r := range s1 {
fmt.Printf(" %d: %X\n", pos, r)
}
fmt.Println("utf8.RuneCount:", utf8.RuneCount([]byte(s1)))
fmt.Println("utf8.RuneCountInString:", utf8.RuneCountInString(s1))
// GraphemeClusterCount returns the number of user-perceived characters
// (grapheme clusters) for the given string.
fmt.Println("uniseg.GraphemeClusterCount:", uniseg.GraphemeClusterCount(s1))
}
Output:
s1: 🈶️
s1 == s2: true
s2 == s3: true
len(s1): 7 bytes
runes:
0: 1F236
4: FE0F
utf8.RuneCount: 2
utf8.RuneCountInString: 2
uniseg.GraphemeClusterCount: 1
References:
-
Rob Pike's excellent article Strings, bytes, runes and characters in Go.
-
The "String literals" section in The Go Programming Language Specification.
-
Henrique Vicente's blog post: UTF-8 strings with Go: len(s) isn't enough.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论