为什么在Go语言中,rune是int32的别名而不是uint32?

huangapple go评论89阅读模式
英文:

Why is rune in golang an alias for int32 and not uint32?

问题

在Go语言中,类型rune定义为:

runeint32的别名,在所有方面等同于int32。按照惯例,它用于区分字符值和整数值。

如果意图是使用这个类型来表示字符值,为什么Go语言的作者没有使用uint32而是使用了int32?当rune值为负数时,他们希望如何处理它在程序中?另一个类似的类型byteuint8的别名(而不是int8),这似乎是合理的。

英文:

The type rune in Go is defined as

> an alias for int32 and is equivalent to int32 in all ways. It is
> used, by convention, to distinguish character values from integer
> values.

If the intention is to use this type to represent character values, why did the authors of the Go language do not use uint32 instead of int32? How do they expect a rune value to be handled in a program, when it is negative? The other similar type, byte, is an alias for uint8 (and not int8), which seems reasonable.

答案1

得分: 29

我在谷歌上搜索并找到了这个

> 这个问题已经被问过很多次了。rune占据4个字节,而不仅仅是一个字节,因为它被设计用来存储Unicode码点,而不仅仅是ASCII字符。和数组索引一样,该数据类型是有符号的,这样你就可以在使用这些类型进行算术运算时轻松检测溢出或其他错误。

英文:

I googled and found this

> This has been asked several times. rune occupies 4 bytes and not just one because it is supposed to store unicode codepoints and not just ASCII characters. Like array indices, the datatype is signed so that you can easily detect overflows or other errors while doing arithmetic with those types.

答案2

得分: 6

它不会变成负数。目前Unicode中有1,114,112个代码点,远远少于2,147,483,647 (0x7fffffff),即使考虑到所有保留的块。

英文:

It doesn’t become negative. There are currently 1,114,112 codepoints in Unicode, which is far from 2,147,483,647 (0x7fffffff) – even considering all the reserved blocks.

答案3

得分: 5

Golang, Go: 顺便问一下,rune是什么?”中提到:

> 根据最新的Unicode 6.3标准,定义了超过110,000个符号。这要求每个码点至少需要21位的表示,所以rune就像int32一样有很多位。

但是关于溢出或负值问题,需要注意一些Unicode函数的实现,比如unicode.IsGraphic,其中包括:

> 我们将其转换为uint32以避免额外的负值测试。

代码:

const MaxLatin1 = '\u00FF' // Latin-1的最大值。

// IsGraphic报告该rune是否被Unicode定义为图形字符。
// 这些字符包括类别L、M、N、P、S、Zs中的字母、标记、数字、标点符号、符号和空格。
func IsGraphic(r rune) bool {
    // 我们将其转换为uint32以避免额外的负值测试,
    // 并且在索引中我们将其转换为uint8以避免范围检查。
    if uint32(r) <= MaxLatin1 {
        return properties[uint8(r)]&pg != 0
    }
    return In(r, GraphicRanges...)
}

这可能是因为rune被认为是常量(如“Go rune类型解释”中所述),其中rune可以是int32uint32甚至是float32等:它的常量值允许它存储在任何这些数值类型中。

英文:

"Golang, Go : what is rune by the way?" mentioned:

> With the recent Unicode 6.3, there are over 110,000 symbols defined. This requires at least 21-bit representation of each code point, so a rune is like int32 and has plenty of bits.

But regarding the overflow or negative value issues, note that the implementation of some of the unicode functions like unicode.IsGraphic do include:

> We convert to uint32 to avoid the extra test for negative

Code:

const MaxLatin1 = &#39;\u00FF&#39; // maximum Latin-1 value.

// IsGraphic reports whether the rune is defined as a Graphic by Unicode.
// Such characters include letters, marks, numbers, punctuation, symbols, and
// spaces, from categories L, M, N, P, S, Zs.
func IsGraphic(r rune) bool {
    // We convert to uint32 to avoid the extra test for negative,
    // and in the index we convert to uint8 to avoid the range check.
    if uint32(r) &lt;= MaxLatin1 {
        return properties[uint8(r)]&amp;pg != 0
    }
    return In(r, GraphicRanges...)
}

That may be because a rune is supposed to be constant (as mentioned in "Go rune type explanation", where a rune could be in an int32 or uint32 or even float32 or ...: its constant value authorizes it to be stored in any of those numeric types).

答案4

得分: 4

它允许负值的事实使您能够定义自己的rune哨兵值

例如:

const EOF rune = -1

func (l *lexer) next() (r rune) {
    if l.pos >= len(l.input) {
        l.width = 0
        return EOF
    }
    r, l.width = utf8.DecodeRuneInString(l.input[l.pos:])
    l.pos += l.width
    return r
}

在这里可以看到Rob Pike的演讲:Go中的词法扫描

英文:

The fact that it's allowed a negative value lets you define your own rune sentinel values.

For example:

const EOF rune = -1

func (l *lexer) next() (r rune) {
	if l.pos &gt;= len(l.input) {
		l.width = 0
		return EOF
	}
	r, l.width = utf8.DecodeRuneInString(l.input[l.pos:])
	l.pos += l.width
	return r
}

Seen here in a talk by Rob Pike: Lexical Scanning in Go.

答案5

得分: 2

除了上面给出的答案之外,我对为什么Go需要rune也有一些看法:

  • Go语言中的字符串是字节数组,每个字符都用一个字节表示。因此,与其他语言相比,Go语言具有非常高的性能优势。
  • 但是,由于我们需要一种表示不能用8位范围表示的UTF-8码点的方法,所以我们使用rune来表示它们。
  • 为什么是int32而不是uint32?这是为了在对字符串进行操作时检测溢出而故意设计的。

这篇文章详细讨论了所有这些内容。

英文:

In addition to the above answers given, here are my two cents to why Go needed rune.

  • Strings in GoLang are byte arrays with each character being represented as a single byte. Thus GoLang has a very high-performance advantage when compared to other languages
  • But since we need a way to represent UTF-8 codepoints which cannot be represented with an 8bit range, we use the rune to represent them.
  • Why int32 and why not uint32? you may ask. This is made deliberately to detect overflows while doing operations on strings.

this article talks all these in much more details

huangapple
  • 本文由 发表于 2014年7月12日 23:55:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/24714665.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定