What is a rune?

huangapple go评论79阅读模式
英文:

What is a rune?

问题

rune 是 Go 语言中的一种数据类型,它是 int32 的别名。在 Go 中,rune 用于表示 Unicode 字符。整数在代码中被用于执行字符大小写转换等操作。

在给出的代码中,<=- 是用于比较和计算的运算符。<= 表示小于等于的比较,- 表示减法运算。

switch 语句没有参数是因为它会根据表达式的值进行匹配,然后执行相应的代码块。

&& 是逻辑运算符,表示逻辑与(and)。r <= 'z' 表示判断 r 是否小于等于字符 'z'

给出的代码中的函数 SwapRune 接受一个 rune 类型的参数 r,根据 r 的值进行判断和计算,然后返回一个新的 rune 值。函数 SwapCase 则使用 strings.Map 函数将输入的字符串中的每个字符都应用 SwapRune 函数进行转换,最终返回转换后的字符串。

runebyte 都是用于表示字符的数据类型,不同之处在于 rune 可以表示任意的 Unicode 字符,而 byte 只能表示 ASCII 字符。在给定的代码中,rune 被用于执行字符大小写转换操作。

英文:

What is a rune in Go?

I've been googling but Golang only says in one line: rune is an alias for int32.

But how come integers are used all around like swapping cases?

The following is a function swapcase.
What is all the &lt;= and -?

And why doesn't switch have any arguments?

&amp;&amp; should mean and but what is r &lt;= &#39;z&#39;?

func SwapRune(r rune) rune {
	switch {
	case &#39;a&#39; &lt;= r &amp;&amp; r &lt;= &#39;z&#39;:
		return r - &#39;a&#39; + &#39;A&#39;
	case &#39;A&#39; &lt;= r &amp;&amp; r &lt;= &#39;Z&#39;:
		return r - &#39;A&#39; + &#39;a&#39;
	default:
		return r
	}
}

Most of them are from http://play.golang.org/p/H6wjLZj6lW

func SwapCase(str string) string {
    return strings.Map(SwapRune, str)
}

I understand this is mapping rune to string so that it can return the swapped string. But I do not understand how exactly rune or byte works here.

答案1

得分: 230

符文字面量只是32位整数值(但它们是无类型常量,所以它们的类型可以改变)。它们表示Unicode码点。例如,符文字面量'a'实际上是数字97

因此,你的程序与下面的程序基本等价:

package main

import "fmt"

func SwapRune(r rune) rune {
	switch {
	case 97 <= r && r <= 122:
		return r - 32
	case 65 <= r && r <= 90:
		return r + 32
	default:
		return r
	}
}

func main() {
	fmt.Println(SwapRune('a'))
}

如果你查看Unicode映射,就会明显看到它与ASCII在该范围内是相同的。此外,32实际上是字符的大写和小写码点之间的偏移量。因此,将32加到'A'上,你会得到'a',反之亦然。

英文:

Rune literals are just 32-bit integer values (however they're untyped constants, so their type can change). They represent unicode codepoints. For example, the rune literal &#39;a&#39; is actually the number 97.

Therefore your program is pretty much equivalent to:

package main

import &quot;fmt&quot;

func SwapRune(r rune) rune {
	switch {
	case 97 &lt;= r &amp;&amp; r &lt;= 122:
		return r - 32
	case 65 &lt;= r &amp;&amp; r &lt;= 90:
		return r + 32
	default:
		return r
	}
}

func main() {
	fmt.Println(SwapRune(&#39;a&#39;))
}

It should be obvious, if you were to look at the Unicode mapping, which is identical to ASCII in that range. Furthermore, 32 is in fact the offset between the uppercase and lowercase codepoint of the character. So by adding 32 to &#39;A&#39;, you get &#39;a&#39; and vice versa.

答案2

得分: 90

从Go语言发布说明中可以得知:http://golang.org/doc/go1#rune

Rune是一种类型。它占据32位,并且用于表示一个Unicode码点。类比一下,英文字符集在ASCII编码中有128个码点。因此,它可以适应一个字节(8位)。基于这个(错误的)假设,C语言将字符视为字节char,将字符串视为字符序列char*

但是猜猜看,除了'abcde..'这些符号之外,人类还发明了许多其他符号。而且有这么多符号,我们需要32位来编码它们。

在Go语言中,stringbytes的序列。然而,由于多个字节可以表示一个rune码点,一个字符串值也可以包含runes。因此,它可以转换为[]rune,反之亦然。

unicode包http://golang.org/pkg/unicode/可以让你体验到这个挑战的丰富性。

英文:

From the Go lang release notes: http://golang.org/doc/go1#rune

Rune is a Type. It occupies 32bit and is meant to represent a Unicode CodePoint.
As an analogy the english characters set encoded in 'ASCII' has 128 code points. Thus is able to fit inside a byte (8bit). From this (erroneous) assumption C treated characters as 'bytes' char, and 'strings' as a 'sequence of characters' char*.

But guess what. There are many other symbols invented by humans other than the 'abcde..' symbols. And there are so many that we need 32 bit to encode them.

In golang then a string is a sequence of bytes. However, since multiple bytes can represent a rune code-point, a string value can also contain runes. So, it can be converted to a []rune, or vice versa.

The unicode package http://golang.org/pkg/unicode/ can give a taste of the richness of the challenge.

答案3

得分: 67

我尽量保持语言简单,以便普通人理解rune

一个rune就是一个字符。就是这样。

它是一个单独的字符。它可以是来自世界上任何语言的任何字母表中的字符。

要获取一个字符串,我们使用

双引号 ""

或者

反引号 ``

字符串和rune是不同的。在rune中,我们使用

单引号 ''

现在,rune也是int32的别名...嗯,为什么呢?

rune是int32的别名的原因是,我们可以看到在下面的编码方案中
What is a rune?

每个字符映射到某个数字,所以我们存储的是这个数字。例如,a_映射到_97,当我们存储这个数字时,它只是一个数字,所以rune是int32的别名。但它不仅仅是任意的数字。它是一个有32个'0和1'或'4'个字节的数字。(注意:UTF-8是一种4字节的编码方案)

rune与字符串的关系是什么?

字符串是一系列的runes。在下面的代码中:

package main

import (
    "fmt"
)

func main() {
    fmt.Println([]byte("Hello"))
}

我们尝试将一个字符串转换为一串字节。输出结果是:

[72 101 108 108 111]

我们可以看到组成该字符串的每个字节都是一个rune。

英文:

I have tried to keep my language simple so that a layman understands rune.

A rune is a character. That's it.

It is a single character. It's a character from any alphabet from any language from anywhere in the world.

To get a string we use

double-quotes &quot;&quot;

OR

back-ticks ``

A string is different than a rune. In runes we use

single-quotes &#39;&#39;

Now a rune is also an alias for int32...Uh What?

The reason rune is an alias for int32 is because we see that with coding schemes such as below
What is a rune?

each character maps to some number and so it's the number that we are storing. For example, a maps to 97 and when we store that number it's just the number and so that's way rune is an alias for int32. But is not just any number. It is a number with 32 'zeros and ones' or '4' bytes. (Note: UTF-8 is a 4-byte encoding scheme)

How runes relate to strings?

A string is a collection of runes. In the following code:

	package main

	import (
		&quot;fmt&quot;
	)

	func main() {
		fmt.Println([]byte(&quot;Hello&quot;))
	}

We try to convert a string to a stream of bytes. The output is:

[72 101 108 108 111]

We can see that each of the bytes that makes up that string is a rune.

答案4

得分: 39

(上面的答案似乎还没有清楚地说明string[]rune之间的区别和关系,所以我会尝试添加另一个带有示例的答案。)

正如@Strangework的回答所说,string[]rune是非常不同的。

string[]rune的区别:

  • string值是一个只读的字节切片。而且,字符串字面量是以utf-8编码的。string中的每个字符实际上占据1~3个字节,而每个rune占据4个字节。
  • 对于stringlen()和索引都是基于字节的。
  • 对于[]runelen()和索引都是基于rune(或int32)的。

string[]rune的关系:

  • 当你从string转换为[]rune时,字符串中的每个utf-8字符都变成了一个rune
  • 类似地,在反向转换中,当从[]rune转换为string时,每个rune都变成了string中的一个utf-8字符。

提示:

  • 你可以在string[]rune之间进行转换,但它们仍然是不同的,无论是类型还是整体大小。

(我会添加一个示例来更清楚地展示这一点。)


代码

string_rune_compare.go:

// string & rune compare,
package main

import "fmt"

// string & rune compare,
func stringAndRuneCompare() {
	// string,
	s := "hello你好"

	fmt.Printf("%s, type: %T, len: %d\n", s, s, len(s))
	fmt.Printf("s[%d]: %v, type: %T\n", 0, s[0], s[0])
	li := len(s) - 1 // last index,
	fmt.Printf("s[%d]: %v, type: %T\n\n", li, s[li], s[li])

	// []rune
	rs := []rune(s)
	fmt.Printf("%v, type: %T, len: %d\n", rs, rs, len(rs))
}

func main() {
	stringAndRuneCompare()
}

执行:

go run string_rune_compare.go

输出:

hello你好, type: string, len: 11
s[0]: 104, type: uint8
s[10]: 189, type: uint8

[104 101 108 108 111 20320 22909], type: []int32, len: 7

解释:

  • 字符串hello你好的长度为11,因为前5个字符每个只占据1个字节,而最后2个中文字符每个占据3个字节。

    • 因此,总字节数 = 5 * 1 + 2 * 3 = 11
    • 由于len()在字符串上是基于字节的,所以第一行打印了len: 11
    • 由于字符串上的索引也是基于字节的,所以下面的两行打印了uint8类型的值(因为在Go中,byteuint8的别名类型)。
  • 当将string转换为[]rune时,它发现了7个utf8字符,因此有7个runes。

    • 由于len()[]rune上是基于rune的,所以最后一行打印了len: 7
    • 如果你通过索引操作[]rune,它将基于rune进行访问。
      由于每个rune来自原始字符串中的一个utf8字符,因此你也可以说[]rune上的len()和索引操作都是基于utf8字符的。
英文:

(Got a feeling that the above answers still didn't state the differences & relationships between string and []rune very clearly, so I would try to add another answer with an example.)

As @Strangework's answer said, string and []rune are quite different.

Differences - string & []rune:

  • string value is a read-only byte slice. And, a string literal is encoded in utf-8. Each char in string actually takes 1 ~ 3 bytes, while each rune takes 4 bytes
  • For string, both len() and index are based on bytes.
  • For []rune, both len() and index are based on rune (or int32).

Relationships - string & []rune:

  • When you convert from string to []rune, each utf-8 char in that string becomes a rune.
  • Similarly, in the reverse conversion, when converting from []rune to string, each rune becomes a utf-8 char in the string.

Tips:

  • You can convert between string and []rune, but still they are different, in both type & overall size.

(I would add an example to show that more clearly.)


Code

string_rune_compare.go:

// string &amp; rune compare,
package main

import &quot;fmt&quot;

// string &amp; rune compare,
func stringAndRuneCompare() {
	// string,
	s := &quot;hello你好&quot;

	fmt.Printf(&quot;%s, type: %T, len: %d\n&quot;, s, s, len(s))
	fmt.Printf(&quot;s[%d]: %v, type: %T\n&quot;, 0, s[0], s[0])
	li := len(s) - 1 // last index,
	fmt.Printf(&quot;s[%d]: %v, type: %T\n\n&quot;, li, s[li], s[li])

	// []rune
	rs := []rune(s)
	fmt.Printf(&quot;%v, type: %T, len: %d\n&quot;, rs, rs, len(rs))
}

func main() {
	stringAndRuneCompare()
}

Execute:

> go run string_rune_compare.go

Output:

hello你好, type: string, len: 11
s[0]: 104, type: uint8
s[10]: 189, type: uint8

[104 101 108 108 111 20320 22909], type: []int32, len: 7

Explanation:

  • The string hello你好 has length 11, because the first 5 chars each take 1 byte only, while the last 2 Chinese chars each take 3 bytes.

    • Thus, total bytes = 5 * 1 + 2 * 3 = 11
    • Since len() on string is based on bytes, thus the first line printed len: 11
    • Since index on string is also based on bytes, thus the following 2 lines print values of type uint8 (since byte is an alias type of uint8, in go).
  • When converting the string to []rune, it found 7 utf8 chars, thus 7 runes.

    • Since len() on []rune is based on rune, thus the last line printed len: 7.
    • If you operate []rune via index, it will access base on rune.
      Since each rune is from a utf8 char in the original string, thus you can also say both len() and index operation on []rune are based on utf8 chars.

答案5

得分: 38

我没有足够的声望在fabrizioM的答案中发表评论,所以我只能在这里发表。

Fabrizio的答案在很大程度上是正确的,他确实捕捉到了问题的本质-尽管必须进行区分。

一个字符串不一定是一系列符文。它是一个对'字节切片'的包装,切片是对Go数组的包装。这有什么区别?

rune类型必然是一个32位的值,这意味着rune类型的值序列必然有一些位数为x32。而字符串作为字节序列,其长度为x8位。如果所有字符串实际上都是Unicode的话,这个区别就没有影响。然而,由于字符串是字节切片,Go可以使用ASCII或任何其他任意的字节编码。

然而,字符串字面量必须以UTF-8编码写入源代码。

信息来源:http://blog.golang.org/strings

英文:

I do not have enough reputation to post a comment to fabrizioM's answer, so I will have to post it here instead.

Fabrizio's answer is largely correct, and he certainly captured the essence of the problem - though there is a distinction which must be made.

A string is NOT necessarily a sequence of runes. It is a wrapper over a 'slice of bytes', a slice being a wrapper over a Go array. What difference does this make?

A rune type is necessarily a 32-bit value, meaning a sequence of values of rune types would necessarily have some number of bits x32. Strings, being a sequence of bytes, instead have a length of x8 bits. If all strings were actually in Unicode, this difference would have no impact. Since strings are slices of bytes, however, Go can use ASCII or any other arbitrary byte encoding.

String literals, however, are required to be written into the source encoded in UTF-8.

Source of information: http://blog.golang.org/strings

答案6

得分: 9

其他人已经讨论了与符文相关的部分,所以我不打算谈论那个。

然而,还有一个与switch没有任何参数相关的问题。这只是因为在Go语言中,没有表达式的switch是表示if/else逻辑的另一种方式。例如,写下这段代码:

t := time.Now()
switch {
case t.Hour() < 12:
    fmt.Println("It's before noon")
default:
    fmt.Println("It's after noon")
}

与写下这段代码是一样的:

t := time.Now()
if t.Hour() < 12 {
    fmt.Println("It's before noon")
} else {
    fmt.Println("It's after noon")
}

你可以在这里阅读更多信息。

英文:

Everyone else has covered the part related to runes, so I am not going to talk about that.

However, there is also a question related to switch not having any arguments. This is simply because in Golang, switch without an expression is an alternate way to express if/else logic. For example, writing this:

t := time.Now()
switch {
case t.Hour() &lt; 12:
    fmt.Println(&quot;It&#39;s before noon&quot;)
default:
    fmt.Println(&quot;It&#39;s after noon&quot;)
}

is same as writing this:

t := time.Now()
if t.Hour() &lt; 12 {
    fmt.Println(&quot;It&#39;s before noon&quot;)
} else {
    fmt.Println(&quot;It&#39;s after noon&quot;)
}

You can read more here.

答案7

得分: 3

一个rune是一个int32值,因此它是Go语言中用于表示Unicode码点的类型。Unicode码点或码位是一个数值,通常用于表示单个Unicode字符;

英文:

A rune is an int32 value, and therefore it is a Go type that is used for representing a Unicode code point. A Unicode code point or code position is a numerical value that is usually used for representing single Unicode characters;

答案8

得分: 2

程序

package main

import (
	"fmt"
)

func main() {
	words := "€25 or less"
	fmt.Println("作为字符串切片")
	fmt.Println(words, len(words))

	runes := []rune(words)
	fmt.Println("\n作为[]rune切片")
	fmt.Printf("%v, 长度:%d\n", runes, len(runes))

	bytes := []byte(words)
	fmt.Println("\n作为[]byte切片")
	fmt.Printf("%v, 长度:%d\n", bytes, len(bytes))
}

输出

作为字符串切片
€25 or less 13

作为[]rune切片
[8364 50 53 32 111 114 32 108 101 115 115], 长度:11

作为[]byte切片
[226 130 172 50 53 32 111 114 32 108 101 115 115], 长度:13

如你所见,欧元符号 '€' 由3个字节表示 - 226、130 和 172。
rune 表示一个字符 - 任何字符,包括象形文字。32位的 rune 足以表示当今世界上的所有字符。因此,欧元符号 '€' 的 rune 表示为 8364。

对于 ASCII 字符,共有128个,一个字节(8位)足够。因此,数字或字母的 rune 表示和字节表示是相同的。例如:2 的表示为 50。

字符串的字节表示的长度总是大于或等于其 rune 表示的长度,因为某些字符由多个字节但在32位内表示,而 rune 的长度为32位。

https://play.golang.org/p/y93woDLs4Qe

英文:

Program

package main

import (
	&quot;fmt&quot;
)

func main() {
	words := &quot;25 or less&quot;
	fmt.Println(&quot;as string slice&quot;)
	fmt.Println(words, len(words))

	runes := []rune(words)
	fmt.Println(&quot;\nas []rune slice&quot;)
	fmt.Printf(&quot;%v, len:%d\n&quot;, runes, len(runes))

	bytes := []byte(words)
	fmt.Println(&quot;\nas []byte slice&quot;)
	fmt.Printf(&quot;%v, len:%d\n&quot;, bytes, len(bytes))
}

Output

as string slice
€25 or less 13

as []rune slice
[8364 50 53 32 111 114 32 108 101 115 115], len:11

as []byte slice
[226 130 172 50 53 32 111 114 32 108 101 115 115], len:13

As you can see, the euro symbol '€' is represented by 3 bytes - 226, 130 & 172.
The rune represents a character - any character be it hieroglyphics. The 32 bits of a rune is sufficient to represent all the characters in the world as of today. Hence, the rune representation of a euro symbol '€' is 8364.

For ASCII characters, which are 128, a byte (8 bits) is sufficient. Hence, a rune and a byte representation of digits or alphabets are the same. E.g: 2 is represented by 50.

A byte representation of a string is always greater than or equal to its rune representation in length since certain characters are represented by more than a byte but within 32 bits, which is a rune.

https://play.golang.org/p/y93woDLs4Qe

答案9

得分: 1

rune是int32的别名,在所有方面等同于int32。它用于区分字符值和整数值。

l = 108, o = 111

英文:

rune is an alias for int32 and is equivalent to int32 in all ways. It is
used to distinguish character values from integer values.

> l = 108, o = 111

答案10

得分: 0

Rune是int32类型的别名。它表示一个单独的Unicode码点。
Unicode联盟为100多万个独特字符分配了称为码点的数值。例如,65是字母A的码点,66是字母B的码点。
(来源:《使用Go编程》)

英文:

Rune is an alias for the int32 type. It represents a single Unicode code point.
The Unicode Consortium assigns numeric values, called code points to over one million unique characters. For example, 65 is code point for letter A, 66 -> B
(source : Get Programming with Go)

huangapple
  • 本文由 发表于 2013年10月11日 13:14:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/19310700.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定