英文:
Read UTF-8 encoded string from io.Reader
问题
我正在使用TCP套接字编写一个小型通信协议。
我能够读取和写入基本数据类型,如整数,但我不知道如何从字节切片中读取UTF-8编码的字符串。
协议客户端是用Java编写的,服务器是用Go编写的。
根据我的阅读:Go的符文(rune)长度为32位,UTF-8字符的长度为1到4个字节,这使得将字节切片简单地转换为字符串变得不可能。
我想知道如何读取和写入这个UTF-8流。
注意
我有读取字符串时的字节缓冲区长度。
英文:
I am writing an small communication protocol with TCP sockets.
I am able to read and write basic data types such as integers but I have no idea of how to read an UTF-8 encoded string from a slice of bytes.
The protocol client is written in Java and the server is Go.
As per I read: GO runes are 32 bit long and UTF-8 chars are 1 to 4 byte long, what makes not possible to simply cast a byte slice to a String.
I'd like to know how can I read and write this UTF-8 stream.
Note
I have the byte buffer length on time to read the string.
答案1
得分: 5
首先讲一些理论知识:
- 在Go语言中,
rune
表示一个Unicode码点,即Unicode中分配给特定字符的数字。它是uint32
的别名。 - UTF-8是一种Unicode的编码方式,用于存储和传输Unicode码点。UTF-8可能使用1到4个字节来编码一个码点。
在Go数据类型中的映射关系如下:
-
[]byte
和string
都存储一系列字节(在Go中,byte
是uint8
的别名)。主要区别在于字符串是不可变的,所以你可以这样做:
b := make([]byte, 2) b[0] = byte('a') b[1] = byte('z')
但是你不能这样做:
var s string s[0] = byte('a')
后一种情况甚至无法显式设置字符串的长度(就像想象中的
s := make(string, 10)
一样)。 -
虽然Go中的字符串包含抽象的字节(你可以自由地在其中存储使用Windows-1252编码的字符),但是某些Go语句和类型转换会将字符串解释为以UTF-8编码,特别是:
string
和[]rune
之间的类型转换将字符串解析为UTF-8编码的码点序列,并生成一个码点切片。反向类型转换从码点切片中获取Unicode码点,并生成一个UTF-8编码的字符串。- 对字符串的
range
循环遍历的是组成字符串的Unicode码点,而不仅仅是字节。
Go还提供了string
和[]byte
之间的类型转换。现在回想一下,字符串是只读的,而字节切片不是。这意味着像下面这样的结构:
b := make([]byte, 1000)
io.ReadFull(r, b)
s := string(b)
无论是将切片转换为字符串还是反过来,都会复制数据。这会浪费空间,但是它是类型安全的,并强制执行语义。
现在回到你手头的任务。
如果你处理的是相对较小的字符串,并且没有内存压力,只需将通过io.Read()
(或其他方式)填充的字节切片转换为字符串即可。确保重用用于读取数据的切片,以减轻垃圾收集器的压力 - 也就是说,不要为每次读取都分配一个新的切片,因为你将把读取代码放入其中的数据复制到一个字符串中。
最后,如果你绝对不得不不复制数据(比如处理多兆字节的字符串,并且有严格的内存要求),你可以尝试使用一些“不安全”的技巧来操作内存 - 这里是一个示例,展示了如何将字节切片的内存“移植”到字符串中。请注意,如果你采用这样的方法,你必须非常清楚地理解它可能在任何新的Go版本中失效,并且甚至不能保证完全正常工作。
英文:
Some theory first:
- A
rune
in Go represents a Unicode code point — a number assigned to a particular character in Unicode. It's an alias touint32
. - UTF-8 is a Unicode encoding — a format of representing Unicode code points for the means of storage and transmission. UTF-8 might use 1 to 4 bytes to encode a single code point.
How this maps on Go data types:
-
Both
[]byte
andstring
store a series of bytes (abyte
in Go is an alias foruint8
).The chief difference is that strings are immutable, so while you can
b := make([]byte, 2) b[0] = byte('a') b[1] = byte('z')
you can't
var s string s[0] = byte('a')
The latter fact is even underlined by the inability to set the string length explicitly (like in imaginary
s := make(string, 10)
). -
While strings in Go contain abstract bytes (you're free to store in them, say, characters encoded using Windows-1252), certain Go statements and type conversions interpret strings as being encoded in UTF-8, in particular:
- A type conversion between
string
and[]rune
parses the string as a sequence of UTF-8-encoded code points and produces a slice of them. The reverse type conversion takes the Unicode code points from the slice of runes and produces an UTF-8-encoded string. - A
range
loop over a string loops through Unicode code points comprising the string, not just bytes.
- A type conversion between
Go also supplies the type conversions between string
and []byte
and back. Now recall that strings are read-only, while slices of bytes are not. This means a construct like
b := make([]byte, 1000)
io.ReadFull(r, b)
s := string(b)
always copies the data, no matter if you convert a slice to a string or back. This wastes space but is type-safe and enforces the semantics.
Now back to your task at hand.
If you work with reasonably small strings and are not under memory pressure, just convert your byte slices filled by io.Read()
(or whatever) to strings. Be sure to reuse the slice you're using to read the data to ease the pressure on the garbage collector — that is, do not allocate a new slice for each new read as you're gonna to copy the data put to it by the reading code off to a string.
Finally, if you absolutely must to not copy the data (say, you're dealing with multi-megabyte strings, and you have tight memory requirements), you may try to play dirty tricks by unsafely working with memory — here is an example of how you might transplant the memory from a byte slice to a string. Note that should you revert to something like this, you must very well understand that it's free to break with any new release of Go, and it's not even guaranteed to work at all.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论