英文:
Does accessing elements of string as byte perform conversion?
问题
在Go语言中,要访问字符串(string
)的元素,我们可以这样写:
str := "text"
for i, c := range str {
// str[i] 的类型是 byte
// c 的类型是 rune
}
当访问 str[i]
时,Go语言是否会将 rune
转换为 byte
?我猜答案是肯定的,但我不确定。如果是这样的话,下面的哪种方法在性能上更好?有没有一种方法比另一种更受推荐(例如,最佳实践方面)?
str := "large text"
for i := range str {
// 使用 str[i]
}
或者
str := "large text"
str2 := []byte(str)
for _, s := range str2 {
// 使用 s
}
英文:
In Go, to access elements of a string
, we can write:
str := "text"
for i, c := range str {
// str[i] is of type byte
// c is of type rune
}
When accessing str[i]
does Go perform a conversion from rune
to byte
? I would guess the answer is yes, but I am not sure. If so, then, which one of the following methods are better performance-wise? Is one preferred over another (in terms of best practice, for example)?
str := "large text"
for i := range str {
// use str[i]
}
or
str := "large text"
str2 := []byte(str)
for _, s := range str2 {
// use s
}
答案1
得分: 3
在Go语言中,string
类型的值存储的是文本的UTF-8编码字节,而不是字符或rune
。
对于string
类型的值,通过索引访问时,str[i]
的类型是byte
(或uint8
,它们是别名)。此外,string
实际上是一个只读的字节切片(带有一些语法糖)。对string
进行索引访问不需要将其转换为切片。
当你在for ... range
循环中使用string
时,它会迭代字符串的rune
(字符),而不是字节!
因此,如果你想迭代rune
(字符),你必须使用for ... range
循环,但不要将其转换为[]byte
,因为第一种形式无法处理包含多字节(UTF-8)字符的string
值。规范允许你在string
值上使用for ... range
循环,第一个迭代值将是当前字符的字节索引,第二个值将是当前字符的rune
类型的值(它是int32
的别名):
> 对于string
值,"range"子句从字节索引0开始迭代字符串中的Unicode码点。在后续的迭代中,索引值将是字符串中连续UTF-8编码码点的第一个字节的索引,第二个值(类型为rune)将是相应码点的值。如果迭代遇到无效的UTF-8序列,则第二个值将是0xFFFD,即Unicode替换字符,并且下一次迭代将在字符串中前进一个字节。
简单示例:
s := "Hi 世界"
for i, c := range s {
fmt.Printf("Char pos: %d, Char: %c\n", i, c)
}
输出结果(在Go Playground上尝试):
Char pos: 0, Char: H
Char pos: 1, Char: i
Char pos: 2, Char:
Char pos: 3, Char: 世
Char pos: 6, Char: 界
你必须阅读的博文:
The Go Blog: Strings, bytes, runes and characters in Go
注意:如果你必须迭代string
的字节(而不是字符),使用转换后的string
进行for ... range
循环不会创建副本,它会被优化掉。详细信息请参见https://stackoverflow.com/questions/43470284/golang-bytestring-vs-bytestring/43470344#43470344。
英文:
string
values in Go store the UTF-8 encoded bytes of the text, not its characters or rune
s.
Indexing a string
indexes its bytes: str[i]
is of type byte
(or uint8
, its an alias). Also a string
is in effect a read-only slice of bytes (with some syntactic sugar). Indexing a string
does not require converting it to a slice.
When you use for ... range
on a string
, that iterates over the rune
s of the string
, not its bytes!
So if you want to iterate over the runes
(characters), you must use a for ... range
but without a conversion to []byte
, as the first form will not work with string
values containing multi(UTF-8)-byte characters.
The spec allows you to for ... range
on a string
value, and the 1st iteration value will be the byte-index of the current character, the 2nd value will be the current character value of type rune
(which is an alias to int32
):
> For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.
Simple example:
s := "Hi 世界"
for i, c := range s {
fmt.Printf("Char pos: %d, Char: %c\n", i, c)
}
Output (try it on the Go Playground):
Char pos: 0, Char: H
Char pos: 1, Char: i
Char pos: 2, Char:
Char pos: 3, Char: 世
Char pos: 6, Char: 界
Must read blog post for you:
The Go Blog: Strings, bytes, runes and characters in Go
Note: If you must iterate over the bytes of a string
(and not its characters), using a for ... range
with a converted string
like your second example does not make a copy, it's optimized away. For details, see https://stackoverflow.com/questions/43470284/golang-bytestring-vs-bytestring/43470344#43470344.
答案2
得分: 1
以下是翻译好的内容:
> 以下哪种方法在性能方面更好?
肯定不是这个。
str := "large text"
str2 := []byte(str)
for _, s := range str2 {
// 使用 s
}
字符串是不可变的,[]byte
是可变的。这意味着 []byte(str)
会进行一次拷贝。因此,上述代码将会拷贝整个字符串。我发现对于大字符串,不知道何时进行字符串拷贝是性能问题的主要原因。
如果 str2
从未被修改过,编译器可能会优化掉这个拷贝。因此,最好将上述代码写成以下形式,以确保字节数组不会被修改。
str := "large text"
for _, s := range []byte(str) {
// 使用 s
}
这样就没有可能在之后修改 str2
并破坏优化。
但是这样做是不好的,因为它会破坏任何多字节字符。请参考下面的示例。
至于字节/符文转换,性能不是一个考虑因素,因为它们并不等价。c
将是一个符文,而 str[i]
将是一个字节。如果你的字符串包含多字节字符,你必须使用符文。
例如...
package main
import(
"fmt"
)
func main() {
str := "snow ☃ man"
for i, c := range str {
fmt.Printf("c:%c str[i]:%c\n", c, str[i])
}
}
$ go run ~/tmp/test.go
c:s str[i]:s
c:n str[i]:n
c:o str[i]:o
c:w str[i]:w
c: str[i]:
c:☃ str[i]:â
c: str[i]:
c:m str[i]:m
c:a str[i]:a
c:n str[i]:n
注意,使用 str[i]
会破坏多字节的 Unicode 雪人,它只包含多字节字符的第一个字节。
无论如何,range str
已经必须按字符而不是按字节进行工作,因此没有性能差异。
英文:
> Which one of the following methods are better performance-wise?
Definitely not this.
str := "large text"
str2 := []byte(str)
for _, s := range str2 {
// use s
}
Strings are immutable. []byte
is mutable. That means []byte(str)
makes a copy. So the above will copy the entire string. I've found being unaware of when strings are copied to be a major source of performance problems for large strings.
If str2
is never altered, the compiler may optimize away the copy. For this reason, it's better to write the above like so to ensure the byte array is never altered.
str := "large text"
for _, s := range []byte(str) {
// use s
}
That way there's no str2
to possibly be modified later and ruin the optimization.
But this is a bad idea because it will corrupt any multi-byte characters. See below.
As for the byte/rune conversion, performance is not a consideration as they are not equivalent. c
will be a rune, and str[i]
will be a byte. If your string contains multi-byte characters, you have to use runes.
For example...
package main
import(
"fmt"
)
func main() {
str := "snow ☃ man"
for i, c := range str {
fmt.Printf("c:%c str[i]:%c\n", c, str[i])
}
}
$ go run ~/tmp/test.go
c:s str[i]:s
c:n str[i]:n
c:o str[i]:o
c:w str[i]:w
c: str[i]:
c:☃ str[i]:â
c: str[i]:
c:m str[i]:m
c:a str[i]:a
c:n str[i]:n
Note that using str[i]
corrupts the multi-byte Unicode snowman, it only contains the first byte of the multi-byte character.
There's no performance difference anyway as range str
already must do the work to go character-by-character, not byte by byte.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论