如何使用UTF-8将[]rune编码为[]byte?

huangapple go评论86阅读模式
英文:

How encode []rune into []byte using utf8

问题

[]byte解码为[]rune非常简单(只需将其转换为string,然后再转换为[]rune即可,我假设它默认为utf8,并使用填充字节来处理无效字符)。我的问题是 - 如何将这个[]rune以utf8形式解码回[]byte

我是否漏掉了什么,或者我是否需要手动调用EncodeRune来处理[]rune中的每个字符?肯定有一个编码器可以简单地传递一个Writer对象。

英文:

It's really easy to decode a []byte into a []rune (simply cast to string, then cast to []rune works very nicely, I'm assuming it defaults to utf8 and with filler bytes for invalids). My question is - how are you suppose to decode this []rune back to []byte in utf8 form?

Am I missing something or do I have manually call EncodeRune for every single rune in my []rune? Surely there is an encoder that I can simply pass a Writer to.

答案1

得分: 52

你可以简单地将一个rune切片([]rune)转换为string,然后再将其转换回[]byte

示例:

rs := []rune{'H', 'e', 'l', 'l', 'o', ' ', '世', '界'}
bs := []byte(string(rs))

fmt.Printf("%s\n", bs)
fmt.Println(string(bs))

输出结果(在Go Playground上尝试):

Hello 世界
Hello 世界

Go规范:转换中明确提到了这种情况:转换到和从字符串类型,第3点:

将rune切片转换为字符串类型会产生一个字符串,该字符串是将每个单独的rune值转换为字符串后连接起来的。

请注意,上述解决方案虽然可能是最简单的,但可能不是最高效的。原因是它首先创建一个将以UTF-8编码形式保存rune的string值的“副本”,然后将字符串的后备切片复制到结果字节切片中(必须进行复制,因为string值是不可变的,如果结果切片与string共享数据,我们将能够修改string的内容;有关详细信息,请参见https://stackoverflow.com/questions/43470284/golang-bytestring-vs-bytestring/43470344#43470344和https://stackoverflow.com/questions/47352449/immutable-string-and-pointer-address/47352588#47352588)。

请注意,聪明的编译器可以检测到中间的string值无法被引用,从而消除其中一个副本。

我们可以通过分配一个单独的字节切片,并将rune逐个编码到其中来获得更好的性能。然后我们就完成了。为了简化操作,我们可以调用unicode/utf8包:

rs := []rune{'H', 'e', 'l', 'l', 'o', ' ', '世', '界'}
bs := make([]byte, len(rs)*utf8.UTFMax)

count := 0
for _, r := range rs {
    count += utf8.EncodeRune(bs[count:], r)
}
bs = bs[:count]

fmt.Printf("%s\n", bs)
fmt.Println(string(bs))

上述代码的输出结果与之前相同。在Go Playground上尝试一下。

请注意,为了创建结果切片,我们必须猜测结果切片的大小。我们使用了一个最大估计,即rune的数量乘以一个rune可能被编码为的最大字节数(utf8.UTFMax)。在大多数情况下,这将比实际需要的要大。

我们可以创建第三个版本,其中我们首先计算所需的确切大小。为此,我们可以使用utf8.RuneLen()函数。好处是我们不会“浪费”内存,并且不需要进行最后的切片操作(bs = bs[:count])。

让我们比较一下性能。要比较的三个函数(三个版本):

func runesToUTF8(rs []rune) []byte {
    return []byte(string(rs))
}

func runesToUTF8Manual(rs []rune) []byte {
    bs := make([]byte, len(rs)*utf8.UTFMax)

    count := 0
    for _, r := range rs {
        count += utf8.EncodeRune(bs[count:], r)
    }

    return bs[:count]
}

func runesToUTF8Manual2(rs []rune) []byte {
    size := 0
    for _, r := range rs {
        size += utf8.RuneLen(r)
    }

    bs := make([]byte, size)

    count := 0
    for _, r := range rs {
        count += utf8.EncodeRune(bs[count:], r)
    }

    return bs
}

以及用于基准测试的代码:

var rs = []rune{'H', 'e', 'l', 'l', 'o', ' ', '世', '界'}

func BenchmarkFirst(b *testing.B) {
    for i := 0; i < b.N; i++ {
        runesToUTF8(rs)
    }
}

func BenchmarkSecond(b *testing.B) {
    for i := 0; i < b.N; i++ {
        runesToUTF8Manual(rs)
    }
}

func BenchmarkThird(b *testing.B) {
    for i := 0; i < b.N; i++ {
        runesToUTF8Manual2(rs)
    }
}

结果如下:

BenchmarkFirst-4        20000000                95.8 ns/op
BenchmarkSecond-4       20000000                84.4 ns/op
BenchmarkThird-4        20000000                81.2 ns/op

正如我们所料,第二个版本更快,第三个版本是最快的,尽管性能提升并不是很大。通常情况下,首选第一种最简单的解决方案,但如果这是您的应用程序的某个关键部分(并且执行次数非常多),则可能值得使用第三个版本。

英文:

You can simply convert a rune slice ([]rune) to string which you can convert back to []byte.

Example:

rs := []rune{&#39;H&#39;, &#39;e&#39;, &#39;l&#39;, &#39;l&#39;, &#39;o&#39;, &#39; &#39;, &#39;世&#39;, &#39;界&#39;}
bs := []byte(string(rs))

fmt.Printf(&quot;%s\n&quot;, bs)
fmt.Println(string(bs))

Output (try it on the Go Playground):

Hello 世界
Hello 世界

The Go Specification: Conversions mentions this case explicitly: Conversions to and from a string type, point #3:

> Converting a slice of runes to a string type yields a string that is the concatenation of the individual rune values converted to strings.

Note that the above solution–although may be the simplest–might not be the most efficient. And the reason is because it first creates a string value that will hold a "copy" of the runes in UTF-8 encoded form, then it copies the backing slice of the string to the result byte slice (a copy has to be made because string values are immutable, and if the result slice would share data with the string, we would be able to modify the content of the string; for details, see https://stackoverflow.com/questions/43470284/golang-bytestring-vs-bytestring/43470344#43470344 and https://stackoverflow.com/questions/47352449/immutable-string-and-pointer-address/47352588#47352588).

<sup>Note that a smart compiler could detect that the intermediate string value cannot be referred to and thus eliminate one of the copies.</sup>

We may get better performance by allocating a single byte slice, and encode the runes one-by-one into it. And we're done. To easily do this, we may call the unicode/utf8 package to our aid:

rs := []rune{&#39;H&#39;, &#39;e&#39;, &#39;l&#39;, &#39;l&#39;, &#39;o&#39;, &#39; &#39;, &#39;世&#39;, &#39;界&#39;}
bs := make([]byte, len(rs)*utf8.UTFMax)

count := 0
for _, r := range rs {
	count += utf8.EncodeRune(bs[count:], r)
}
bs = bs[:count]

fmt.Printf(&quot;%s\n&quot;, bs)
fmt.Println(string(bs))

Output of the above is the same. Try it on the Go Playground.

Note that in order to create the result slice, we had to guess how big the result slice will be. We used a maximum estimation, which is the number of runes multiplied by the max number of bytes a rune may be encoded to (utf8.UTFMax). In most cases, this will be bigger than needed.

We may create a third version where we first calculate the exact size needed. For this, we may use the utf8.RuneLen() function. The gain will be that we will not "waste" memory, and we won't have to do a final slicing (bs = bs[:count]).

Let's compare the performances. The 3 functions (3 versions) to compare:

func runesToUTF8(rs []rune) []byte {
	return []byte(string(rs))
}

func runesToUTF8Manual(rs []rune) []byte {
	bs := make([]byte, len(rs)*utf8.UTFMax)

	count := 0
	for _, r := range rs {
		count += utf8.EncodeRune(bs[count:], r)
	}

	return bs[:count]
}

func runesToUTF8Manual2(rs []rune) []byte {
	size := 0
	for _, r := range rs {
		size += utf8.RuneLen(r)
	}

	bs := make([]byte, size)

	count := 0
	for _, r := range rs {
		count += utf8.EncodeRune(bs[count:], r)
	}

	return bs
}

And the benchmarking code:

var rs = []rune{&#39;H&#39;, &#39;e&#39;, &#39;l&#39;, &#39;l&#39;, &#39;o&#39;, &#39; &#39;, &#39;世&#39;, &#39;界&#39;}

func BenchmarkFirst(b *testing.B) {
	for i := 0; i &lt; b.N; i++ {
		runesToUTF8(rs)
	}
}

func BenchmarkSecond(b *testing.B) {
	for i := 0; i &lt; b.N; i++ {
		runesToUTF8Manual(rs)
	}
}

func BenchmarkThird(b *testing.B) {
	for i := 0; i &lt; b.N; i++ {
		runesToUTF8Manual2(rs)
	}
}

And the results:

BenchmarkFirst-4        20000000                95.8 ns/op
BenchmarkSecond-4       20000000                84.4 ns/op
BenchmarkThird-4        20000000                81.2 ns/op

As suspected, the second version is faster and the third version is the fastest, although the performance gain is not huge. In general the first, simplest solution is preferred, but if this is in some critical part of your app (and is executed many-many times), the third version might worth it to be used.

huangapple
  • 本文由 发表于 2015年3月25日 20:31:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/29255746.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定