非英文字符串仍然是“只读的字节切片”吗?

huangapple go评论88阅读模式
英文:

Is non-English string still 'a read-only slice of bytes'?

问题

在https://go.dev/blog/strings中提到:

在Go中,字符串实际上是一个只读的字节切片。

据我理解,byte数据类型在Go中等同于表示ASCII字符的uint8,它可以完美地处理只包含英文字母的字符串。

对于非英语字符串,比如日语、韩语、中文、阿拉伯语等,是否仍然可以说“在Go中,字符串实际上是一个只读的字节切片”?

或者我可以说“在Go中,非英语字符串实际上是一个只读的*Rune*切片”,因为ASCII不支持包含日语、韩语、中文、阿拉伯语字符的字符串,这些字符必须使用Unicode或UTF-8用Rune表示。

英文:

It's mentioned in https://go.dev/blog/strings that:

In Go, a string is in effect a read-only slice of bytes.

To my understanding, byte data type is equivalent to uint8 in Go that represents the ASCII characters which works perfectly with strings that consists of English letters only.

For non-English string, such as Japanese, Korean, Chinese, Arabic etc, is it still correct to say "In Go, a string is in effect a read-only slice of bytes."?

Or can I say "In Go, a non-English string is in effect a read-only slice of Rune" because apparently ASCII does not support the strings with Japanese, Korean, Chinese, Arabic characters which must be represented in Unicode or UTF-8 using Rune.

答案1

得分: 4

据我理解,字节数据类型在Go中相当于uint8,表示ASCII字符,它与仅包含英文字母的字符串完美配合。

不,字节并不意味着ASCII。Go在任何情况下都不使用ASCII。

Go中的字符串通常是UTF-8编码的。标准库中的字符串函数都使用UTF-8编码。使用range将字符串作为一系列符文访问时,假设字符串是UTF-8编码的。UTF-8是Unicode到字节的编码方式。无论你使用哪种语言,这些都是正确的。

字符串也可以包含不是UTF-8的数据;正如你引用的文章所说,字符串基本上只是一个不可变的[]byte,可以包含任何字节序列,包括二进制数据和其他编码的字符数据。这是完全有效的;只是对这些“字符串”使用strings函数或range没有意义。这些类型只是捕捉了可变和不可变之间的区别;它们没有捕捉“字符字符串”和“一堆字节”之间的区别。

英文:

> To my understanding, byte data type is equivalent to uint8 in Go that represents the ASCII characters which works perfectly with strings that consists of English letters only.

No. Byte doesn't mean ASCII. Go doesn't use ASCII for anything.

Strings in Go are normally UTF-8. The string functions in the standard library all work with UTF-8. Accessing a string as a series of runes using range assumes that the string is UTF-8. UTF-8 is an encoding of Unicode into bytes. All of this is true regardless of what language you're working with.

Strings can also contain data that isn't UTF-8; as the article you quoted said, a string is basically just an immutable []byte, and can contain any sequence of bytes, including binary data, and character data in other encodings than UTF-8. This is perfectly valid; it just doesn't make sense to use strings functions or range on these "strings". The types really only capture the difference between mutable and immutable; they don't capture the difference between "a character string" and "a bunch of bytes".

答案2

得分: 3

是的,无论字符集如何,string都将是一个字节切片。例如:

s := "селёдка"
fmt.Printf("%d\n", len(s))

即使这个单词只有7个字母,上述代码将打印出14。这意味着你不能使用s[2]来获取第三个字符。

然而,当你在字符串上进行迭代时,你会得到符文(rune):

s := "селёдка"
for _, c := range s {
    fmt.Printf("%s\n", c)
}

上述代码将逐个打印出单词的每个字母。

如果你想直接处理符文(rune),可以将字符串转换为切片:

r := []rune(s)
英文:

Yes, string will be a slice of bytes regardless of the charset. For example:

s := "селёдка"
fmt.Printf("%d\n", len(s))

will print 14 even though the word is 7 letters long. That means, you cannot e.g. use s[2] to get the third characters.

However, when you're iterating over a string, you are getting runes:

s := "селёдка"
for _, c := range s {
    fmt.Printf("%s\n", c)
}

will print the word letter by letter.

If you want to deal with the runes directly, convert the string to the slice:

r := []rune(s)

huangapple
  • 本文由 发表于 2021年9月26日 23:52:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/69336534.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定