访问字符串元素作为字节时是否执行转换?

huangapple go评论71阅读模式
英文:

Does accessing elements of string as byte perform conversion?

问题

在Go语言中,要访问字符串(string)的元素,我们可以这样写:

str := "text"
for i, c := range str {
  // str[i] 的类型是 byte
  // c 的类型是 rune
}

当访问 str[i] 时,Go语言是否会将 rune 转换为 byte?我猜答案是肯定的,但我不确定。如果是这样的话,下面的哪种方法在性能上更好?有没有一种方法比另一种更受推荐(例如,最佳实践方面)?

str := "large text"
for i := range str {
  // 使用 str[i]
}

或者

str := "large text"
str2 := []byte(str)
for _, s := range str2 {
  // 使用 s
}
英文:

In Go, to access elements of a string, we can write:

str := "text"
for i, c := range str {
  // str[i] is of type byte
  // c is of type rune
}

When accessing str[i] does Go perform a conversion from rune to byte? I would guess the answer is yes, but I am not sure. If so, then, which one of the following methods are better performance-wise? Is one preferred over another (in terms of best practice, for example)?

str := "large text"
for i := range str {
  // use str[i]
}

or

str := "large text"
str2 := []byte(str)
for _, s := range str2 {
  // use s
}

答案1

得分: 3

在Go语言中,string类型的值存储的是文本的UTF-8编码字节,而不是字符或rune

对于string类型的值,通过索引访问时,str[i]的类型是byte(或uint8,它们是别名)。此外,string实际上是一个只读的字节切片(带有一些语法糖)。对string进行索引访问不需要将其转换为切片。

当你在for ... range循环中使用string时,它会迭代字符串的rune(字符),而不是字节!

因此,如果你想迭代rune(字符),你必须使用for ... range循环,但不要将其转换为[]byte,因为第一种形式无法处理包含多字节(UTF-8)字符的string值。规范允许你string值上使用for ... range循环,第一个迭代值将是当前字符的字节索引,第二个值将是当前字符的rune类型的值(它是int32的别名):

> 对于string值,"range"子句从字节索引0开始迭代字符串中的Unicode码点。在后续的迭代中,索引值将是字符串中连续UTF-8编码码点的第一个字节的索引,第二个值(类型为rune)将是相应码点的值。如果迭代遇到无效的UTF-8序列,则第二个值将是0xFFFD,即Unicode替换字符,并且下一次迭代将在字符串中前进一个字节。

简单示例:

s := "Hi 世界"
for i, c := range s {
    fmt.Printf("Char pos: %d, Char: %c\n", i, c)
}

输出结果(在Go Playground上尝试):

Char pos: 0, Char: H
Char pos: 1, Char: i
Char pos: 2, Char:  
Char pos: 3, Char: 世
Char pos: 6, Char: 界

你必须阅读的博文:

The Go Blog: Strings, bytes, runes and characters in Go


注意:如果你必须迭代string的字节(而不是字符),使用转换后的string进行for ... range循环不会创建副本,它会被优化掉。详细信息请参见https://stackoverflow.com/questions/43470284/golang-bytestring-vs-bytestring/43470344#43470344。

英文:

string values in Go store the UTF-8 encoded bytes of the text, not its characters or runes.

Indexing a string indexes its bytes: str[i] is of type byte (or uint8, its an alias). Also a string is in effect a read-only slice of bytes (with some syntactic sugar). Indexing a string does not require converting it to a slice.

When you use for ... range on a string, that iterates over the runes of the string, not its bytes!

So if you want to iterate over the runes (characters), you must use a for ... range but without a conversion to []byte, as the first form will not work with string values containing multi(UTF-8)-byte characters.
The spec allows you to for ... range on a string value, and the 1st iteration value will be the byte-index of the current character, the 2nd value will be the current character value of type rune (which is an alias to int32):

> For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.

Simple example:

s := "Hi 世界"
for i, c := range s {
	fmt.Printf("Char pos: %d, Char: %c\n", i, c)
}

Output (try it on the Go Playground):

Char pos: 0, Char: H
Char pos: 1, Char: i
Char pos: 2, Char:  
Char pos: 3, Char: 世
Char pos: 6, Char: 界

Must read blog post for you:

The Go Blog: Strings, bytes, runes and characters in Go


Note: If you must iterate over the bytes of a string (and not its characters), using a for ... range with a converted string like your second example does not make a copy, it's optimized away. For details, see https://stackoverflow.com/questions/43470284/golang-bytestring-vs-bytestring/43470344#43470344.

答案2

得分: 1

以下是翻译好的内容:

> 以下哪种方法在性能方面更好?

肯定不是这个。

str := "large text"
str2 := []byte(str)
for _, s := range str2 {
  // 使用 s
}

字符串是不可变的,[]byte 是可变的。这意味着 []byte(str) 会进行一次拷贝。因此,上述代码将会拷贝整个字符串。我发现对于大字符串,不知道何时进行字符串拷贝是性能问题的主要原因。

如果 str2 从未被修改过,编译器可能会优化掉这个拷贝。因此,最好将上述代码写成以下形式,以确保字节数组不会被修改。

str := "large text"
for _, s := range []byte(str) {
  // 使用 s
}

这样就没有可能在之后修改 str2 并破坏优化。

但是这样做是不好的,因为它会破坏任何多字节字符。请参考下面的示例。


至于字节/符文转换,性能不是一个考虑因素,因为它们并不等价。c 将是一个符文,而 str[i] 将是一个字节。如果你的字符串包含多字节字符,你必须使用符文。

例如...

package main

import(
    "fmt"
)

func main() {
    str := "snow ☃ man"
    for i, c := range str {
        fmt.Printf("c:%c str[i]:%c\n", c, str[i])
    }
}

$ go run ~/tmp/test.go
c:s str[i]:s
c:n str[i]:n
c:o str[i]:o
c:w str[i]:w
c:  str[i]: 
c:☃ str[i]:â
c:  str[i]: 
c:m str[i]:m
c:a str[i]:a
c:n str[i]:n

注意,使用 str[i] 会破坏多字节的 Unicode 雪人,它只包含多字节字符的第一个字节。

无论如何,range str 已经必须按字符而不是按字节进行工作,因此没有性能差异。

英文:

> Which one of the following methods are better performance-wise?

Definitely not this.

str := "large text"
str2 := []byte(str)
for _, s := range str2 {
  // use s
}

Strings are immutable. []byte is mutable. That means []byte(str) makes a copy. So the above will copy the entire string. I've found being unaware of when strings are copied to be a major source of performance problems for large strings.

If str2 is never altered, the compiler may optimize away the copy. For this reason, it's better to write the above like so to ensure the byte array is never altered.

str := "large text"
for _, s := range []byte(str) {
  // use s
}

That way there's no str2 to possibly be modified later and ruin the optimization.

But this is a bad idea because it will corrupt any multi-byte characters. See below.


As for the byte/rune conversion, performance is not a consideration as they are not equivalent. c will be a rune, and str[i] will be a byte. If your string contains multi-byte characters, you have to use runes.

For example...

package main

import(
    "fmt"
)

func main() {
    str := "snow ☃ man"
    for i, c := range str {
        fmt.Printf("c:%c str[i]:%c\n", c, str[i])
    }
}

$ go run ~/tmp/test.go
c:s str[i]:s
c:n str[i]:n
c:o str[i]:o
c:w str[i]:w
c:  str[i]: 
c:☃ str[i]:â
c:  str[i]: 
c:m str[i]:m
c:a str[i]:a
c:n str[i]:n

Note that using str[i] corrupts the multi-byte Unicode snowman, it only contains the first byte of the multi-byte character.

There's no performance difference anyway as range str already must do the work to go character-by-character, not byte by byte.

huangapple
  • 本文由 发表于 2017年6月12日 03:31:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/44487910.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定