英文:
RuneCountInString get invalid count when character contain 「U+FE0F」
问题
以下是要翻译的内容:
t := "👾️"
fmt.Println(utf8.RuneCountInString(t))
我认为打印计数等于1更好。为什么它返回2?
英文:
t := "🈶️"
fmt.Println(utf8.RuneCountInString(t))
i think print count == 1 is better. why it return 2
答案1
得分: 2
字符🈶️由2个代码点(U+1F236和U+FE0F)表示。"代码点"这个词有点啰嗦,所以Go引入了一个更短的术语来表示这个概念:rune。utf8.RuneCountInString返回runes的数量为2,并且按预期工作。
如果你想计算字符的数量,可以尝试使用github.com/rivo/uniseg包。
下面的示例代码应该能更好地解释它:
package main
import (
	"fmt"
	"unicode/utf8"
	"github.com/rivo/uniseg"
)
func main() {
	s1 := "🈶️"                           // UTF-8输入文本
	s2 := "\U0001f236\ufe0f"             // <== 显式的Unicode代码点
	s3 := "\xf0\x9f\x88\xb6\xef\xb8\x8f" // 显式的UTF-8字节
	fmt.Println("s1:", s1)
	fmt.Println("s1 == s2:", s1 == s2)
	fmt.Println("s2 == s3:", s2 == s3)
	fmt.Println("len(s1):", len(s1), "bytes")
	fmt.Println("runes:")
	for pos, r := range s1 {
		fmt.Printf("  %d: %X\n", pos, r)
	}
	fmt.Println("utf8.RuneCount:", utf8.RuneCount([]byte(s1)))
	fmt.Println("utf8.RuneCountInString:", utf8.RuneCountInString(s1))
	// GraphemeClusterCount返回给定字符串的用户感知字符(图形簇)的数量。
	fmt.Println("uniseg.GraphemeClusterCount:", uniseg.GraphemeClusterCount(s1))
}
输出:
s1: 🈶️
s1 == s2: true
s2 == s3: true
len(s1): 7 bytes
runes:
  0: 1F236
  4: FE0F
utf8.RuneCount: 2
utf8.RuneCountInString: 2
uniseg.GraphemeClusterCount: 1
参考资料:
- 
Rob Pike的优秀文章Go中的字符串、字节、runes和字符。
 - 
Go编程语言规范中的"字符串字面量"部分。
 - 
Henrique Vicente的博文:Go中的UTF-8字符串:len(s)不够用。
 
英文:
The character 🈶️ is represented by 2 code points (U+1F236 and U+FE0F). “Code point” is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. utf8.RuneCountInString returns the number of runes 2 and works as expected.
Try the package github.com/rivo/uniseg if you want to count the number of characters.
The demo below should explain it better:
package main
import (
	"fmt"
	"unicode/utf8"
	"github.com/rivo/uniseg"
)
func main() {
	s1 := "🈶️"                           // UTF-8 input text
	s2 := "\U0001f236\ufe0f"             // <== the explicit Unicode code points
	s3 := "\xf0\x9f\x88\xb6\xef\xb8\x8f" // the explicit UTF-8 bytes
	fmt.Println("s1:", s1)
	fmt.Println("s1 == s2:", s1 == s2)
	fmt.Println("s2 == s3:", s2 == s3)
	fmt.Println("len(s1):", len(s1), "bytes")
	fmt.Println("runes:")
	for pos, r := range s1 {
		fmt.Printf("  %d: %X\n", pos, r)
	}
	fmt.Println("utf8.RuneCount:", utf8.RuneCount([]byte(s1)))
	fmt.Println("utf8.RuneCountInString:", utf8.RuneCountInString(s1))
	// GraphemeClusterCount returns the number of user-perceived characters
	// (grapheme clusters) for the given string.
	fmt.Println("uniseg.GraphemeClusterCount:", uniseg.GraphemeClusterCount(s1))
}
Output:
s1: 🈶️
s1 == s2: true
s2 == s3: true
len(s1): 7 bytes
runes:
  0: 1F236
  4: FE0F
utf8.RuneCount: 2
utf8.RuneCountInString: 2
uniseg.GraphemeClusterCount: 1
References:
- 
Rob Pike's excellent article Strings, bytes, runes and characters in Go.
 - 
The "String literals" section in The Go Programming Language Specification.
 - 
Henrique Vicente's blog post: UTF-8 strings with Go: len(s) isn't enough.
 
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论