如何获取字符串中的字符数

huangapple go评论80阅读模式
英文:

How to get the number of characters in a string

问题

我该如何在Go中获取字符串的字符数?

例如,如果我有一个字符串"hello",该方法应该返回5。我注意到len(str)返回的是字节数而不是字符数,所以len("£")返回的是2而不是1,因为£在UTF-8中用两个字节编码。

英文:

How can I get the number of characters of a string in Go?

For example, if I have a string "hello" the method should return 5. I saw that len(str) returns the number of bytes and not the number of characters so len("£") returns 2 instead of 1 because £ is encoded with two bytes in UTF-8.

答案1

得分: 237

你可以尝试使用utf8包中的RuneCountInString函数。

返回字符串中的rune数量

此脚本所示:字符串"World"的长度可能是6(当用中文写时为"世界"),但是"世界"的rune数量是2:

package main
    
import "fmt"
import "unicode/utf8"
    
func main() {
    fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
}

Phrozen评论中补充道:

实际上,你可以通过类型转换来使用len()计算rune的数量。
len([]rune("世界"))将会输出2。至少在Go 1.3中是这样的。


而且,通过CL 108985(2018年5月,适用于Go 1.11),len([]rune(string))现在进行了优化。(修复了issue 24923

编译器会自动检测到len([]rune(string))的模式,并将其替换为for r := range s的调用。

添加了一个新的运行时函数来计算字符串中的rune数量。
修改编译器以检测模式len([]rune(string))并将其替换为新的rune计数运行时函数。

RuneCount/lenruneslice/ASCII        27.8ns ± 2%  14.5ns ± 3%  -47.70%
RuneCount/lenruneslice/Japanese     126ns ± 2%   60  ns ± 2%  -52.03%
RuneCount/lenruneslice/MixedLength  104ns ± 2%   50  ns ± 1%  -51.71%

Stefan Steiger指向了博文"Go中的文本规范化"。

什么是字符?

如同在strings博文中提到的,字符可以跨越多个rune
例如,一个'e'和'◌́◌́'(重音符"\u0301")可以组合成'é'("e\u0301"的NFD形式)。这两个rune一起构成一个字符

字符的定义可能因应用程序而异。
对于**规范化**,我们将其定义为:

  • 以starter(不会修改或与其他rune组合的rune)开头的rune序列,
  • 后面可能是空的非starter序列,即不会修改其他rune的rune(通常是重音符)。

规范化算法逐个字符处理。

使用该包及其Iter类型,实际的"字符"数量将会是:

package main
    
import "fmt"
import "golang.org/x/text/unicode/norm"
    
func main() {
    var ia norm.Iter
    ia.InitString(norm.NFKD, "école")
    nc := 0
    for !ia.Done() {
        nc = nc + 1
        ia.Next()
    }
    fmt.Printf("Number of chars: %d\n", nc)
}

这里使用了Unicode规范化形式NFKD的"兼容分解"。


Oliver答案指向了**UNICODE TEXT SEGMENTATION**作为可靠确定某些重要文本元素(用户可感知的字符、单词和句子)之间默认边界的唯一方法。

为此,你需要一个外部库,比如rivo/uniseg,它可以进行Unicode文本分割

这将实际上计算"字形簇"的数量,其中多个码点可以组合成一个用户可感知的字符。

package uniseg
    
import (
    "fmt"
    
    "github.com/rivo/uniseg"
)
    
func main() {
    gr := uniseg.NewGraphemes("👍🏼!")
    for gr.Next() {
        fmt.Printf("%x ", gr.Runes())
    }
    // 输出:[1f44d 1f3fc] [21]
}

两个字形簇,尽管有三个rune(Unicode码点)。

你可以在"如何在GO中操作字符串以进行反转?"的问题中看到其他示例。

👩‍🦰独自是一个字形簇,但是根据unicode到码点转换器,有4个rune:

英文:

You can try RuneCountInString from the utf8 package.

> returns the number of runes in p

that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but the rune count of "世界" is 2:

package main
    
import "fmt"
import "unicode/utf8"
    
func main() {
	fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
}

Phrozen adds in the comments:

Actually you can do len() over runes by just type casting.
len([]rune("世界")) will print 2. At least in Go 1.3.


And with CL 108985 (May 2018, for Go 1.11), len([]rune(string)) is now optimized. (Fixes issue 24923)

The compiler detects len([]rune(string)) pattern automatically, and replaces it with for r := range s call.

> Adds a new runtime function to count runes in a string.
Modifies the compiler to detect the pattern len([]rune(string))
and replaces it with the new rune counting runtime function.
>
> RuneCount/lenruneslice/ASCII 27.8ns ± 2% 14.5ns ± 3% -47.70%
> RuneCount/lenruneslice/Japanese 126ns ± 2% 60 ns ± 2% -52.03%
> RuneCount/lenruneslice/MixedLength 104ns ± 2% 50 ns ± 1% -51.71%


Stefan Steiger points to the blog post "Text normalization in Go"

> ## What is a character?

> As was mentioned in the strings blog post, characters can span multiple runes.
For example, an 'e' and '◌́◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character.
>
> The definition of a character may vary depending on the application.
For normalization we will define it as:
>
> - a sequence of runes that starts with a starter,
>- a rune that does not modify or combine backwards with any other rune,
>- followed by possibly empty sequence of non-starters, that is, runes that do (typically accents).
>
> The normalization algorithm processes one character at at time.

Using that package and its Iter type, the actual number of "character" would be:

package main
    
import "fmt"
import "golang.org/x/text/unicode/norm"
    
func main() {
    var ia norm.Iter
    ia.InitString(norm.NFKD, "école")
    nc := 0
    for !ia.Done() {
    	nc = nc + 1
    	ia.Next()
    }
    fmt.Printf("Number of chars: %d\n", nc)
}

Here, this uses the Unicode Normalization form NFKD "Compatibility Decomposition"


Oliver's answer points to UNICODE TEXT SEGMENTATION as the only way to reliably determining default boundaries between certain significant text elements: user-perceived characters, words, and sentences.

For that, you need an external library like rivo/uniseg, which does Unicode Text Segmentation.

That will actually count "grapheme cluster", where multiple code points may be combined into one user-perceived character.

package uniseg
    
import (
	"fmt"
    
	"github.com/rivo/uniseg"
)
    
func main() {
	gr := uniseg.NewGraphemes("👍🏼!")
	for gr.Next() {
		fmt.Printf("%x ", gr.Runes())
    }
    // Output: [1f44d 1f3fc] [21]
}

Two graphemes, even though there are three runes (Unicode code points).

You can see other examples in "How to manipulate strings in GO to reverse them?"

👩🏾‍🦰 alone is one grapheme, but, from unicode to code points converter, 4 runes:

答案2

得分: 56

有一种方法可以在不使用任何包的情况下获取符文的数量,即将字符串转换为[]rune,然后使用len([]rune(YOUR_STRING))

package main

import "fmt"

func main() {
    russian := "Спутник и погром"
    english := "Sputnik & pogrom"
    
    fmt.Println("字节数量:",
        len(russian),
        len(english))
    
    fmt.Println("符文数量:",
        len([]rune(russian)),
        len([]rune(english)))
    
}

字节数量 30 16

符文数量 16 16

英文:

There is a way to get count of runes without any packages by converting string to []rune as len([]rune(YOUR_STRING)):

package main

import "fmt"

func main() {
	russian := "Спутник и погром"
	english := "Sputnik & pogrom"
	
	fmt.Println("count of bytes:",
        len(russian),
        len(english))
    
	fmt.Println("count of runes:",
        len([]rune(russian)),
        len([]rune(english)))
	
}

> count of bytes 30 16

> count of runes 16 16

答案3

得分: 9

我应该指出,到目前为止提供的答案都没有给出您期望的字符数,特别是当您处理表情符号(还有一些像泰语、韩语或阿拉伯语这样的语言)时。VonC的建议将输出以下内容:

fmt.Println(utf8.RuneCountInString("🏳️‍🌈🇩🇪")) // 输出"6"。
fmt.Println(len([]rune("🏳️‍🌈🇩🇪"))) // 输出"6"。

这是因为这些方法只计算Unicode代码点。有许多字符可以由多个代码点组成。

使用Normalization包也是一样的:

var ia norm.Iter
ia.InitString(norm.NFKD, "🏳️‍🌈🇩🇪")
nc := 0
for !ia.Done() {
	nc = nc + 1
	ia.Next()
}
fmt.Println(nc) // 输出"6"。

规范化实际上并不等同于计算字符数,许多字符无法规范化为一个代码点等效。

masakielastic的答案接近正确,但只处理修饰符(彩虹旗包含一个修饰符,因此不被计算为自己的代码点):

fmt.Println(GraphemeCountInString("🏳️‍🌈🇩🇪"))  // 输出"5"。
fmt.Println(GraphemeCountInString2("🏳️‍🌈🇩🇪")) // 输出"5"。

将Unicode字符串拆分为(用户感知的)字符,即图形簇,的正确方法在Unicode标准附录#29中定义。规则可以在第3.1.1节中找到。github.com/rivo/uniseg包实现了这些规则,因此您可以确定字符串中的正确字符数:

fmt.Println(uniseg.GraphemeClusterCount("🏳️‍🌈🇩🇪")) // 输出"2"。
英文:

I should point out that none of the answers provided so far give you the number of characters as you would expect, especially when you're dealing with emojis (but also some languages like Thai, Korean, or Arabic). VonC's suggestions will output the following:

fmt.Println(utf8.RuneCountInString("🏳️‍🌈🇩🇪")) // Outputs "6".
fmt.Println(len([]rune("🏳️‍🌈🇩🇪"))) // Outputs "6".

That's because these methods only count Unicode code points. There are many characters which can be composed of multiple code points.

Same for using the Normalization package:

var ia norm.Iter
ia.InitString(norm.NFKD, "🏳️‍🌈🇩🇪")
nc := 0
for !ia.Done() {
	nc = nc + 1
	ia.Next()
}
fmt.Println(nc) // Outputs "6".

Normalization is not really the same as counting characters and many characters cannot be normalized into a one-code-point equivalent.

masakielastic's answer comes close but only handles modifiers (the rainbow flag contains a modifier which is thus not counted as its own code point):

fmt.Println(GraphemeCountInString("🏳️‍🌈🇩🇪"))  // Outputs "5".
fmt.Println(GraphemeCountInString2("🏳️‍🌈🇩🇪")) // Outputs "5".

The correct way to split Unicode strings into (user-perceived) characters, i.e. grapheme clusters, is defined in the Unicode Standard Annex #29. The rules can be found in Section 3.1.1. The github.com/rivo/uniseg package implements these rules so you can determine the correct number of characters in a string:

fmt.Println(uniseg.GraphemeClusterCount("🏳️‍🌈🇩🇪")) // Outputs "2".

答案4

得分: 6

如果您需要考虑字形簇,请使用正则表达式或unicode模块。由于字形簇的长度是无限的,因此计算码点(符文)或字节的数量也是必需的。如果您想要消除非常长的序列,请检查序列是否符合流安全文本格式

package main

import (
    "regexp"
    "unicode"
    "strings"
)

func main() {

    str := "\u0308" + "a\u0308" + "o\u0308" + "u\u0308"
    str2 := "a" + strings.Repeat("\u0308", 1000)

    println(4 == GraphemeCountInString(str))
    println(4 == GraphemeCountInString2(str))

    println(1 == GraphemeCountInString(str2))
    println(1 == GraphemeCountInString2(str2))

    println(true == IsStreamSafeString(str))
    println(false == IsStreamSafeString(str2))
}


func GraphemeCountInString(str string) int {
    re := regexp.MustCompile("\\PM\\pM*|.")
    return len(re.FindAllString(str, -1))
}

func GraphemeCountInString2(str string) int {

    length := 0
    checked := false
    index := 0

    for _, c := range str {

        if !unicode.Is(unicode.M, c) {
            length++

            if checked == false {
                checked = true
            }

        } else if checked == false {
            length++
        }

        index++
    }

    return length
}

func IsStreamSafeString(str string) bool {
    re := regexp.MustCompile("\\PM\\pM{30,}") 
    return !re.MatchString(str) 
}
英文:

If you need to take grapheme clusters into account, use regexp or unicode module. Counting the number of code points(runes) or bytes also is needed for validaiton since the length of grapheme cluster is unlimited. If you want to eliminate extremely long sequences, check if the sequences conform to stream-safe text format.

package main
import (
"regexp"
"unicode"
"strings"
)
func main() {
str := "\u0308" + "a\u0308" + "o\u0308" + "u\u0308"
str2 := "a" + strings.Repeat("\u0308", 1000)
println(4 == GraphemeCountInString(str))
println(4 == GraphemeCountInString2(str))
println(1 == GraphemeCountInString(str2))
println(1 == GraphemeCountInString2(str2))
println(true == IsStreamSafeString(str))
println(false == IsStreamSafeString(str2))
}
func GraphemeCountInString(str string) int {
re := regexp.MustCompile("\\PM\\pM*|.")
return len(re.FindAllString(str, -1))
}
func GraphemeCountInString2(str string) int {
length := 0
checked := false
index := 0
for _, c := range str {
if !unicode.Is(unicode.M, c) {
length++
if checked == false {
checked = true
}
} else if checked == false {
length++
}
index++
}
return length
}
func IsStreamSafeString(str string) bool {
re := regexp.MustCompile("\\PM\\pM{30,}") 
return !re.MatchString(str) 
}

答案5

得分: 6

有几种方法可以获取字符串的长度:

package main

import (
	"bytes"
	"fmt"
	"strings"
	"unicode/utf8"
)

func main() {
	b := "这是个测试"
	len1 := len([]rune(b))
	len2 := bytes.Count([]byte(b), nil) -1
	len3 := strings.Count(b, "") - 1
	len4 := utf8.RuneCountInString(b)
	fmt.Println(len1)
	fmt.Println(len2)
	fmt.Println(len3)
	fmt.Println(len4)

}
英文:

There are several ways to get a string length:

package main
import (
"bytes"
"fmt"
"strings"
"unicode/utf8"
)
func main() {
b := "这是个测试"
len1 := len([]rune(b))
len2 := bytes.Count([]byte(b), nil) -1
len3 := strings.Count(b, "") - 1
len4 := utf8.RuneCountInString(b)
fmt.Println(len1)
fmt.Println(len2)
fmt.Println(len3)
fmt.Println(len4)
}

答案6

得分: 5

取决于你对“字符”定义的理解。如果“rune等于字符”对你的任务来说是可以的(通常不是这样),那么VonC的答案对你来说是完美的。否则,可能需要注意的是,在Unicode字符串中,符文的数量是一个有趣的值的情况很少。即使在这些情况下,最好在“遍历”字符串时推断出计数,因为这样可以避免重复UTF-8解码的工作。

英文:

Depends a lot on your definition of what a "character" is. If "rune equals a character " is OK for your task (generally it isn't) then the answer by VonC is perfect for you. Otherwise, it should be probably noted, that there are few situations where the number of runes in a Unicode string is an interesting value. And even in those situations it's better, if possible, to infer the count while "traversing" the string as the runes are processed to avoid doubling the UTF-8 decode effort.

答案7

得分: 0

我试图让规范化过程变得更快一些:

    en, _ = glyphSmart(data)
    
    func glyphSmart(text string) (int, int) {
        gc := 0
        dummy := 0
        for ind, _ := range text {
            gc++
            dummy = ind
        }
        dummy = 0
        return gc, dummy
    }
英文:

I tried to make to do the normalization a bit faster:

    en, _ = glyphSmart(data)
func glyphSmart(text string) (int, int) {
gc := 0
dummy := 0
for ind, _ := range text {
gc++
dummy = ind
}
dummy = 0
return gc, dummy
}

huangapple
  • 本文由 发表于 2012年10月1日 14:52:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/12668681.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定