将[]byte转换为字符串和将字符串转换为[]byte的开销

huangapple go评论76阅读模式
英文:

Overhead of converting from []byte to string and vice-versa

问题

我似乎总是在不停地将字符串转换为[]byte,然后再转换回字符串。这样做会有很多额外开销吗?有没有更好的方法?

例如,这是一个函数,它接受一个UTF8字符串,对其进行规范化、去除重音,然后将特殊字符转换为ASCII等价字符:

var transliterations = map[rune]string{'Æ':"AE",'Ð':"D",'Ł':"L",'Ø':"OE",'Þ':"Th",'ß':"ss",'æ':"ae",'ð':"d",'ł':"l",'ø':"oe",'þ':"th",'Œ':"OE",'œ':"oe"}
func RemoveAccents(s string) string {
    b := make([]byte, len(s))
    t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
    _, _, e := t.Transform(b, []byte(s), true)
    if e != nil { panic(e) }
    r := string(b)
    
    var f bytes.Buffer
    for _, c := range r {
        temp := rune(c)
        if val, ok := transliterations[temp]; ok {
            f.WriteString(val)
        } else {
            f.WriteRune(temp)
        }
    }
    return f.String()
}

所以我从一个字符串开始,因为这是我得到的,然后我将其转换为字节数组,然后再转换回字符串,然后再转换为字节数组,然后再转换回字符串。这肯定是不必要的,但我无法弄清楚如何避免这样做...?这样做真的会有很多开销吗,还是我不必担心过多的转换会减慢速度?

(另外,如果有人有时间,我还没有弄清楚bytes.Buffer实际上是如何工作的,初始化一个大小为字符串大小两倍的缓冲区是否更好,因为这是返回值的最大输出大小?)

英文:

I always seem to be converting strings to []byte to string again over and over. Is there a lot of overhead with this? Is there a better way?

For example, here is a function that accepts a UTF8 string, normalizes it, remove accents, then converts special characters to ASCII equivalent:

var transliterations = map[rune]string{'Æ':"AE",'Ð':"D",'Ł':"L",'Ø':"OE",'Þ':"Th",'ß':"ss",'æ':"ae",'ð':"d",'ł':"l",'ø':"oe",'þ':"th",'Œ':"OE",'œ':"oe"}
func RemoveAccents(s string) string {
	b := make([]byte, len(s))
	t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
	_, _, e := t.Transform(b, []byte(s), true)
	if e != nil { panic(e) }
	r := string(b)
	
	var f bytes.Buffer
	for _, c := range r {
		temp := rune(c)
		if val, ok := transliterations[temp]; ok {
			f.WriteString(val)
		} else {
			f.WriteRune(temp)
		}
	}
	return f.String()
}

So I'm starting with a string because that's what I get, then I'm converting it to a byte array, then back to a string, then to a byte array again, then back to a string again. Surely this is unnecessary but I can't figure out how to not do this..? And does it really have a lot of overhead or do I not have to worry about slowing things down with excessive conversions?

(Also if anyone has the time I've not yet figured out how bytes.Buffer actually works, would it not be better to initialize a buffer of 2x the size of the string, which is the maximum output size of the return value?)

答案1

得分: 4

在Go语言中,string是不可变的,所以任何更改都会创建一个新的字符串。作为一般规则,将string转换为byterune切片一次,然后再转换回string一次。为了避免重新分配内存,在进行小型和临时分配时,可以过度分配以提供安全余量,如果不知道确切的数量。

例如,

package main

import (
	"bytes"
	"fmt"
	"unicode"
	"unicode/utf8"

	"golang.org/x/text/transform"
	"golang.org/x/text/unicode/norm"
)

var isMn = func(r rune) bool {
	return unicode.Is(unicode.Mn, r) // Mn: 非间距标记
}

var transliterations = map[rune]string{
	'Æ': "AE", 'Ð': "D", 'Ł': "L", 'Ø': "OE", 'Þ': "Th",
	'ß': "ss", 'æ': "ae", 'ð': "d", 'ł': "l", 'ø': "oe",
	'þ': "th", 'Œ': "OE", 'œ': "oe",
}

func RemoveAccents(b []byte) ([]byte, error) {
	mnBuf := make([]byte, len(b)*125/100)
	t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
	n, _, err := t.Transform(mnBuf, b, true)
	if err != nil {
		return nil, err
	}
	mnBuf = mnBuf[:n]
	tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*125/100))
	for i, w := 0, 0; i < len(mnBuf); i += w {
		r, width := utf8.DecodeRune(mnBuf[i:])
		if s, ok := transliterations[r]; ok {
			tlBuf.WriteString(s)
		} else {
			tlBuf.WriteRune(r)
		}
		w = width
	}
	return tlBuf.Bytes(), nil
}

func main() {
	in := "test stringß"
	fmt.Println(in)
	inBytes := []byte(in)
	outBytes, err := RemoveAccents(inBytes)
	if err != nil {
		fmt.Println(err)
	}
	out := string(outBytes)
	fmt.Println(out)
}

输出:

test stringß
test stringss
英文:

In Go, strings are immutable so any change creates a new string. As a general rule, convert from a string to a byte or rune slice once and convert back to a string once. To avoid reallocations, for small and transient allocations, over-allocate to provide a safety margin if you don't know the exact number.

For example,

package main
import (
&quot;bytes&quot;
&quot;fmt&quot;
&quot;unicode&quot;
&quot;unicode/utf8&quot;
&quot;code.google.com/p/go.text/transform&quot;
&quot;code.google.com/p/go.text/unicode/norm&quot;
)
var isMn = func(r rune) bool {
return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
var transliterations = map[rune]string{
&#39;&#198;&#39;: &quot;AE&quot;, &#39;&#208;&#39;: &quot;D&quot;, &#39;Ł&#39;: &quot;L&quot;, &#39;&#216;&#39;: &quot;OE&quot;, &#39;&#222;&#39;: &quot;Th&quot;,
&#39;&#223;&#39;: &quot;ss&quot;, &#39;&#230;&#39;: &quot;ae&quot;, &#39;&#240;&#39;: &quot;d&quot;, &#39;ł&#39;: &quot;l&quot;, &#39;&#248;&#39;: &quot;oe&quot;,
&#39;&#254;&#39;: &quot;th&quot;, &#39;Œ&#39;: &quot;OE&quot;, &#39;œ&#39;: &quot;oe&quot;,
}
func RemoveAccents(b []byte) ([]byte, error) {
mnBuf := make([]byte, len(b)*125/100)
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
n, _, err := t.Transform(mnBuf, b, true)
if err != nil {
return nil, err
}
mnBuf = mnBuf[:n]
tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*125/100))
for i, w := 0, 0; i &lt; len(mnBuf); i += w {
r, width := utf8.DecodeRune(mnBuf[i:])
if s, ok := transliterations[r]; ok {
tlBuf.WriteString(s)
} else {
tlBuf.WriteRune(r)
}
w = width
}
return tlBuf.Bytes(), nil
}
func main() {
in := &quot;test string&#223;&quot;
fmt.Println(in)
inBytes := []byte(in)
outBytes, err := RemoveAccents(inBytes)
if err != nil {
fmt.Println(err)
}
out := string(outBytes)
fmt.Println(out)
}

Output:

test string&#223;
test stringss

答案2

得分: 3

这个问题没有答案。如果这些转换在你的应用程序中成为性能瓶颈,你应该修复它们。如果不是的话,就不需要了。

你有在真实负载下对应用程序进行性能分析吗?RemoveAccents 是瓶颈吗?没有?那为什么要费心呢?

真的:我认为可以做得更好(减少垃圾、减少迭代和转换次数),比如通过一些“TransliterationTransformer”进行链接。但我怀疑这是否值得麻烦。

英文:

There is no answer to this question. If these conversions are a performance bottleneck in your application you should fix them. If not: Not.

Did you profile your application under realistic load and RemoveAccents is the bottleneck? No? So why bother?

Really: I assume one could do better (in the sense of less garbage, less iterations and less conversions) e.g. by chaining in some "TransliterationTransformer". But I doubt it would be wirth the hassle.

答案3

得分: 1

将字符串转换为字节切片(不是数组,这是不同的类型)会带来一些额外开销,主要是为字节切片分配空间。

字符串是一种独立的类型,是字节序列的一种解释。但并不是每个字节序列都是有用的字符串。字符串也是不可变的。如果你查看strings包,你会看到字符串经常被sliced

在你的示例中,你可以省略第二次转换回字符串。你也可以对字节切片进行范围遍历。

对于关于性能的每个问题,你可能需要进行测量。字节切片的分配是否真的是你的瓶颈?

你可以这样初始化bytes.Buffer

f := bytes.NewBuffer(make([]byte, 0, len(s)*2))

其中,大小为0,容量为字符串大小的2倍。如果你能估计缓冲区的大小,最好这样做。这将节省底层字节切片的一些重新分配。

英文:

There is a small overhead with converting a string to a byte slice (not an array, that's a different type). Namely allocating the space for the byte slice.

Strings are its own type and are an interpretation of a sequence of bytes. But not every sequence of bytes is a useful string. Strings are also immutable. If you look at the strings package, you will see that strings will be sliced a lot.

In your example you can omit the second conversion back to string. You can also range over a byte slice.

As with every question about performance: you will probably need to measure. Is the allocation of byte slices really your bottleneck?

You can initialize your bytes.Buffer like so:

f := bytes.NewBuffer(make([]byte, 0, len(s)*2))

where you have a size of 0 and a capacity of 2x the size of your string. If you can estimate the size of your buffer, it is probably good to do that. It will save you a few reallocations of the underlying byte slices.

huangapple
  • 本文由 发表于 2014年7月23日 15:18:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/24904089.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定