从io.Reader中读取UTF-8编码的字符串。

huangapple go评论82阅读模式
英文:

Read UTF-8 encoded string from io.Reader

问题

我正在使用TCP套接字编写一个小型通信协议。
我能够读取和写入基本数据类型,如整数,但我不知道如何从字节切片中读取UTF-8编码的字符串。

协议客户端是用Java编写的,服务器是用Go编写的。

根据我的阅读:Go的符文(rune)长度为32位,UTF-8字符的长度为1到4个字节,这使得将字节切片简单地转换为字符串变得不可能。

我想知道如何读取和写入这个UTF-8流。

注意
我有读取字符串时的字节缓冲区长度。

英文:

I am writing an small communication protocol with TCP sockets.
I am able to read and write basic data types such as integers but I have no idea of how to read an UTF-8 encoded string from a slice of bytes.

The protocol client is written in Java and the server is Go.

As per I read: GO runes are 32 bit long and UTF-8 chars are 1 to 4 byte long, what makes not possible to simply cast a byte slice to a String.

I'd like to know how can I read and write this UTF-8 stream.

Note
I have the byte buffer length on time to read the string.

答案1

得分: 5

首先讲一些理论知识:

  • 在Go语言中,rune表示一个Unicode码点,即Unicode中分配给特定字符的数字。它是uint32的别名。
  • UTF-8是一种Unicode的编码方式,用于存储和传输Unicode码点。UTF-8可能使用1到4个字节来编码一个码点。

在Go数据类型中的映射关系如下:

  • []bytestring都存储一系列字节(在Go中,byteuint8的别名)。

    主要区别在于字符串是不可变的,所以你可以这样做:

      b := make([]byte, 2)
      b[0] = byte('a')
      b[1] = byte('z')
    

    但是你不能这样做:

      var s string
      s[0] = byte('a')
    

    后一种情况甚至无法显式设置字符串的长度(就像想象中的s := make(string, 10)一样)。

  • 虽然Go中的字符串包含抽象的字节(你可以自由地在其中存储使用Windows-1252编码的字符),但是某些Go语句和类型转换会将字符串解释为以UTF-8编码,特别是:

    • string[]rune之间的类型转换将字符串解析为UTF-8编码的码点序列,并生成一个码点切片。反向类型转换从码点切片中获取Unicode码点,并生成一个UTF-8编码的字符串。
    • 对字符串的range循环遍历的是组成字符串的Unicode码点,而不仅仅是字节。

Go还提供了string[]byte之间的类型转换。现在回想一下,字符串是只读的,而字节切片不是。这意味着像下面这样的结构:

b := make([]byte, 1000)
io.ReadFull(r, b)
s := string(b)

无论是将切片转换为字符串还是反过来,都会复制数据。这会浪费空间,但是它是类型安全的,并强制执行语义。

现在回到你手头的任务。

如果你处理的是相对较小的字符串,并且没有内存压力,只需将通过io.Read()(或其他方式)填充的字节切片转换为字符串即可。确保重用用于读取数据的切片,以减轻垃圾收集器的压力 - 也就是说,不要为每次读取都分配一个新的切片,因为你将把读取代码放入其中的数据复制到一个字符串中。

最后,如果你绝对不得不不复制数据(比如处理多兆字节的字符串,并且有严格的内存要求),你可以尝试使用一些“不安全”的技巧来操作内存 - 这里是一个示例,展示了如何将字节切片的内存“移植”到字符串中。请注意,如果你采用这样的方法,你必须非常清楚地理解它可能在任何新的Go版本中失效,并且甚至不能保证完全正常工作。

英文:

Some theory first:

  • A rune in Go represents a Unicode code point — a number assigned to a particular character in Unicode. It's an alias to uint32.
  • UTF-8 is a Unicode encoding — a format of representing Unicode code points for the means of storage and transmission. UTF-8 might use 1 to 4 bytes to encode a single code point.

How this maps on Go data types:

  • Both []byte and string store a series of bytes (a byte in Go is an alias for uint8).

    The chief difference is that strings are immutable, so while you can

      b := make([]byte, 2)
      b[0] = byte('a')
      b[1] = byte('z')
    

    you can't

      var s string
      s[0] = byte('a')
    

    The latter fact is even underlined by the inability to set the string length explicitly (like in imaginary s := make(string, 10)).

  • While strings in Go contain abstract bytes (you're free to store in them, say, characters encoded using Windows-1252), certain Go statements and type conversions interpret strings as being encoded in UTF-8, in particular:

    • A type conversion between string and []rune parses the string as a sequence of UTF-8-encoded code points and produces a slice of them. The reverse type conversion takes the Unicode code points from the slice of runes and produces an UTF-8-encoded string.
    • A range loop over a string loops through Unicode code points comprising the string, not just bytes.

Go also supplies the type conversions between string and []byte and back. Now recall that strings are read-only, while slices of bytes are not. This means a construct like

b := make([]byte, 1000)
io.ReadFull(r, b)
s := string(b)

always copies the data, no matter if you convert a slice to a string or back. This wastes space but is type-safe and enforces the semantics.

Now back to your task at hand.

If you work with reasonably small strings and are not under memory pressure, just convert your byte slices filled by io.Read() (or whatever) to strings. Be sure to reuse the slice you're using to read the data to ease the pressure on the garbage collector — that is, do not allocate a new slice for each new read as you're gonna to copy the data put to it by the reading code off to a string.

Finally, if you absolutely must to not copy the data (say, you're dealing with multi-megabyte strings, and you have tight memory requirements), you may try to play dirty tricks by unsafely working with memory — here is an example of how you might transplant the memory from a byte slice to a string. Note that should you revert to something like this, you must very well understand that it's free to break with any new release of Go, and it's not even guaranteed to work at all.

huangapple
  • 本文由 发表于 2013年11月25日 22:07:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/20195145.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定