当字符包含「U+FE0F」时,RuneCountInString函数返回无效计数。

huangapple go评论92阅读模式
英文:

RuneCountInString get invalid count when character contain 「U+FE0F」

问题

以下是要翻译的内容:

enter image description here

t := "👾️"
fmt.Println(utf8.RuneCountInString(t))

我认为打印计数等于1更好。为什么它返回2?

英文:

enter image description here

t := "🈶️"
fmt.Println(utf8.RuneCountInString(t))

i think print count == 1 is better. why it return 2

答案1

得分: 2

字符🈶️由2个代码点(U+1F236U+FE0F)表示。"代码点"这个词有点啰嗦,所以Go引入了一个更短的术语来表示这个概念:rune。utf8.RuneCountInString返回runes的数量为2,并且按预期工作。

如果你想计算字符的数量,可以尝试使用github.com/rivo/uniseg包。

下面的示例代码应该能更好地解释它:

package main

import (
	"fmt"
	"unicode/utf8"

	"github.com/rivo/uniseg"
)

func main() {
	s1 := "🈶️"                           // UTF-8输入文本
	s2 := "\U0001f236\ufe0f"             // <== 显式的Unicode代码点
	s3 := "\xf0\x9f\x88\xb6\xef\xb8\x8f" // 显式的UTF-8字节
	fmt.Println("s1:", s1)
	fmt.Println("s1 == s2:", s1 == s2)
	fmt.Println("s2 == s3:", s2 == s3)
	fmt.Println("len(s1):", len(s1), "bytes")
	fmt.Println("runes:")
	for pos, r := range s1 {
		fmt.Printf("  %d: %X\n", pos, r)
	}
	fmt.Println("utf8.RuneCount:", utf8.RuneCount([]byte(s1)))
	fmt.Println("utf8.RuneCountInString:", utf8.RuneCountInString(s1))

	// GraphemeClusterCount返回给定字符串的用户感知字符(图形簇)的数量。
	fmt.Println("uniseg.GraphemeClusterCount:", uniseg.GraphemeClusterCount(s1))
}

输出:

s1: 🈶️
s1 == s2: true
s2 == s3: true
len(s1): 7 bytes
runes:
  0: 1F236
  4: FE0F
utf8.RuneCount: 2
utf8.RuneCountInString: 2
uniseg.GraphemeClusterCount: 1

参考资料:

英文:

The character 🈶️ is represented by 2 code points (U+1F236 and U+FE0F). “Code point” is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. utf8.RuneCountInString returns the number of runes 2 and works as expected.

Try the package github.com/rivo/uniseg if you want to count the number of characters.

The demo below should explain it better:

package main

import (
	"fmt"
	"unicode/utf8"

	"github.com/rivo/uniseg"
)

func main() {
	s1 := "🈶"                           // UTF-8 input text
	s2 := "\U0001f236\ufe0f"             // <== the explicit Unicode code points
	s3 := "\xf0\x9f\x88\xb6\xef\xb8\x8f" // the explicit UTF-8 bytes
	fmt.Println("s1:", s1)
	fmt.Println("s1 == s2:", s1 == s2)
	fmt.Println("s2 == s3:", s2 == s3)
	fmt.Println("len(s1):", len(s1), "bytes")
	fmt.Println("runes:")
	for pos, r := range s1 {
		fmt.Printf("  %d: %X\n", pos, r)
	}
	fmt.Println("utf8.RuneCount:", utf8.RuneCount([]byte(s1)))
	fmt.Println("utf8.RuneCountInString:", utf8.RuneCountInString(s1))

	// GraphemeClusterCount returns the number of user-perceived characters
	// (grapheme clusters) for the given string.
	fmt.Println("uniseg.GraphemeClusterCount:", uniseg.GraphemeClusterCount(s1))
}

Output:

s1: 🈶️
s1 == s2: true
s2 == s3: true
len(s1): 7 bytes
runes:
  0: 1F236
  4: FE0F
utf8.RuneCount: 2
utf8.RuneCountInString: 2
uniseg.GraphemeClusterCount: 1

References:

huangapple
  • 本文由 发表于 2023年3月24日 13:12:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75830226.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定