关于Go语言中的符文、字符串和Unicode字符的问题

huangapple go评论82阅读模式
英文:

questions about runes, strings & unicode characters in go

问题

在Go语言中,string是由不可变的byte集合组成的。byteuint8的别名。runeint32的别名,用于存储字符。

为什么rune使用int32而不是uint32?因为不存在负字符的概念。

string使用byte,其中每个byte足以存储ASCII字符,但不足以存储Unicode字符。然而,Go语言可以在字符串中存储Unicode字符,但索引字符时会丢失数据。在Go语言中,不能将float64隐式转换为int,因为可能会丢失数据,但是索引包含Unicode字符的string并不会引发任何错误,只是丢失数据。我如何从string中索引出一个rune而不是一个byte

考虑以下程序及其输出。

package main

import (
	"fmt"
)

func main() {
	x := "ඞ"
	y := x[0]

	z := 'ඞ'

	fmt.Printf("%s vs %c vs %c\n", x, y, z)
}
ඞ vs à vs ඞ

我觉得string用于存储Unicode字符的方式是将字节组合起来,因为也可以从x中索引出其中的一个字节。

英文:

A string in go is a collection of immutable bytes. A byte is an alias for uint8. A rune is an alias for int32 that is used to store characters.

Why do runes use int32s, instead of uin32s? There is no such thing known as a negative character.

strings use bytes, in which each byte is enough to store ascii characters, but not unicode characters. How ever, go can store unicode characters in strings, but indexing a character it loses it's data. You can't convert a float64 to an int implicitly in go, since it might lose that, but this conversion of indexing a string, containing a unicode character, does not raise any errors and just loses its data. How can I index a rune out of a string, instead of a byte?

Consider the following program and its output.

package main

import (
	"fmt"
)

func main() {
	x := ""
	y := x[0]

	z := ''

	fmt.Printf("%s vs %c vs %c\n", x, y, z)
}
ඞ vs à vs ඞ

What I feel like a string does for storing unicode characters is combining bytes, since it's possible to index 1 out of x as well.

答案1

得分: 1

逐个回答你的问题...

为什么rune是int32而不是uint32?

我猜这可能与机器级别上的整数的本机表示有关,可能对有符号整数进行了优化。

但最终这并不重要。

首先,Unicode码点(目前至少如此)仅使用范围0x0000到0x10ffff。也就是说,在处理“合法”的Unicode时,你永远不会遇到负的rune。

如果存在一个int24这样的类型,那就足够了。Unicode(码点)不使用高8位(显然是符号位)。

所以,使用int32的原因可能是这个,而与“优化”无关。

但即使Unicode规范扩展到完整的32位范围,这仍然不会造成问题。

无论是有符号还是无符号,内部表示都是一致的。因此,例如,如果某些Go代码与其他代码交换rune,并且该其他代码使用无符号类型,那么就不会有问题,因为基本上交换的是每个rune中的32位,而不是由任何特定类型叠加在这32位上的解释。

如果使用rune进行算术运算,符号可能很重要,但如果你这样做,我预计你会对rune有深入的了解,并且知道如何安全地操作它们(可能是为了某种形式的加密目的——我想不出其他进行rune算术的原因)。

在字符串中索引一个字节会“丢失数据”

不,索引字符串中的一个字节(它只是一个[]byte)会准确地给你所要求的数据:指定的单个字节。

没有数据丢失(或增加)。

如果你想要由字符串中的一系列字节表示的_rune_,那么你需要请求表示该rune的_所有_字节。

在字符串中索引一个rune

首先将字符串([]byte)转换为[]rune,然后像对待任何其他切片一样进行索引。所以,给定一个字符串s,想要获取第i个rune:

r := []rune(s)[i]
英文:

To take your questions in turn...

Why is rune a int32 rather than uint32?

I suspect this may be something to do with native representations of ints at the machine level which may be optimised for signed ints vs unsigned.

But ultimately it does not matter.

First of all, Unicode codepoints (currently at least) only use the range 0x0000 to 0x10ffff. i.e. you will never encounter a negative rune when dealing with legitimate Unicode.

If there was such a thing as a int24, this would be sufficient. The upper 8 bits (where the sign bit resides, obviously) are unused by Unicode (codepoints).

so it could be that this is the reason for using int32 and has nothing to do with "optimisation".

But even if the Unicode specification were to expand to the full 32-bit range, this still would not present a problem.

Whether signed or unsigned, the internal representation would be consistent. So, for example, if some go code were to exchange runes with some other code and if that other code is using an unsigned type, there would be no problem since fundamentally what is being exchanged are the 32 bits in each rune, not the interpretation overlaid on those 32 bits by any particular type.

The sign might be important if performing arithmetic using runes, though if you were doing that I would expect you would have a deep understanding of runes and how to manipulate them safely (presumably for the purposes of some form of cryptography - I can't think of any other reason for doing rune arithmetic).

Indexing a Byte in a String "loses data"

No, indexing a byte in a string (which is just a []byte) gives you precisely the data you asked for: the 1, single specified byte.

Nothing is lost (or gained).

If you want a rune represented by a sequence of bytes in a string then you need to ask for all of the bytes that represent that rune.

Indexing a Rune in a String

First convert the string ([]byte) to []rune, then index as you would any other slice. So, given a string s and wishing to obtain the ith rune:

r := []rune(s)[i]

huangapple
  • 本文由 发表于 2022年12月16日 03:11:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/74816440.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定