无效字符的字节到字符串转换

huangapple go评论79阅读模式
英文:

bytes to string conversion with invalid characters

问题

我需要解析可能无效或包含错误的UDP数据包。我想在将字节转换为字符串后,将无效字符替换为,以便显示数据包的内容。

我该如何做呢?这是我的代码:

func main() {
   a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
   s := string(a)
   s = strings.Replace(s, string(0xFFFD), ".", 0)

   fmt.Println("s: ", s) // 我想显示"a..b."
   for _, r := range s {
      fmt.Println("r: ", r)
   }
   rs := []rune(s)
   fmt.Println("rs: ", rs)
}
英文:

I need to parse UDP packets which can be invalid or contain some errors. I would like to replace invalid characters with . after a bytes to string conversion, in order to display the content of the packets.

How can I do it? This is my code:

func main() {
   a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
   s := string(a)
   s = strings.Replace(s, string(0xFFFD), ".", 0)

   fmt.Println("s: ", s) // I would like to display "a..b."
   for _, r := range s {
	  fmt.Println("r: ", r)
   }
   rs := []rune(s)
   fmt.Println("rs: ", rs)
}

答案1

得分: 5

你的方法存在一个根本问题,就是将[]byte转换为string的结果中没有任何U+FFFD:这种类型转换只是将源字节逐字节地复制到目标位置。

与字节切片一样,Go中的字符串并不一定包含UTF-8编码的文本;它们可以包含任何数据,包括与文本无关的不透明二进制数据。

但是,对字符串进行某些操作,比如将其转换为[]rune使用range进行迭代,会将字符串解释为UTF-8编码的文本。
这正是你被绊倒的地方:你的range调试循环试图对字符串进行解释,每次解码一个正确编码的码点失败时,range都会产生一个替换字符U+FFFD
再次强调,通过类型转换得到的字符串并不包含你想要用正则表达式替换的字符。

关于如何将你的数据转换为有效的UTF-8编码字符串,你可以采用两步过程:

  1. 将字节切片转换为字符串,就像你已经做的那样。
  2. 使用任何将字符串解释为UTF-8的方法,在迭代过程中替换动态出现的U+FFFD

类似这样的代码:

var sb strings.Builder
for _, c := range string(b) {
  if c == '\uFFFD' {
    sb.WriteByte('.')
  } else {
    sb.WriteRune(c)
  }
}
return sb.String()

关于性能的说明:由于将[]byte转换为string会复制内存(因为字符串是不可变的,而切片不是),对于处理大块数据和/或在紧密处理循环中工作的代码来说,类型转换的第一步可能是资源浪费。
在这种情况下,可以考虑使用encoding/utf8包中的DecodeRune函数,它适用于字节切片。
可以根据其文档中的示例轻松地调整为与上面的循环一起使用。

另请参阅:从字符串中删除无效的UTF-8字符

英文:

The root problem with your approach is that the result of type converting []byte to string does not have any U+FFFDs in it: this type-conversion only copies bytes from the source to the destination, verbatim.
Just as byte slices, strings in Go are not obliged to contain UTF-8-encoded text; they can contain any data, including opaque binary data which has nothing to do with text.

But some operations on strings—namely type-converting them to []rune and iterating over them using rangedo interpret strings as UTF-8-encoded text.
That is precisely where you got tripped: your range debugging loop attempted to interpret the string, and each time another attempt at decoding a properly encoded code point failed, range yielded a replacement character, U+FFFD.
To reiterate, the string obtained by the type-conversion does not contain the characters you wanted to get replaced by your regexp.

As to how to actually make a valid UTF-8-encoded string out of your data, you might employ a two-step process:

  1. Type-convert your byte slice to a string—as you already do.
  2. Use any means of interpreting a string as UTF-8—replacing U+FFFD which will dynamically appear during this process—as you're iterating.

Something like this:

var sb strings.Builder
for _, c := range string(b) {
  if c == '\uFFFD' {
    sb.WriteByte('.')
  } else {
    sb.WriteRune(c)
  }
}
return sb.String()

A note on performance: since type-converting a []byte to string copies memory—because strings are immutable while slices are not—the first step with type-conversion might be a waste of resources for code dealing with large chunks of data and/or working in tight processing loops.
In this case, it may be worth using the DecodeRune function of the encoding/utf8 package which works on byte slices.
An example from its docs can be easily adapted to work with the loop above.

See also: Remove invalid UTF-8 characters from a string

答案2

得分: 5

@kostix的答案是正确的,并且非常清楚地解释了从字符串中扫描Unicode符文的问题。

只是补充以下说明:如果你只想查看ASCII范围内的字符(可打印字符<127),并且不关心其他Unicode码点,你可以更加直接:

// 创建一个与s具有相同字节长度的字节切片
var bs = make([]byte, len(s))

// 逐字节扫描s:
for i := 0; i < len(s); i++ {
    switch {
    case 32 <= s[i] && s[i] <= 126:
        bs[i] = s[i]

    // 根据需要,你也可以保留0..31范围内的字符,
    // 比如'tab' (9), 'linefeed' (10) 或 'carriage return' (13):
    // case s[i] == 9, s[i] == 10, s[i] == 13:
    //   bs[i] = s[i]

    default:
        bs[i] = '.'
    }
}

fmt.Printf("rs: %s\n", bs)

playground

这个函数将给你类似于hexdump -C命令中的"text"部分的结果。

英文:

@kostix answer is correct and explains very clearly the issue with scanning unicode runes from a string.

Just adding the following remark : if your intention is to view characters only in the ASCII range (printable characters < 127) and you don't really care about other unicode code points, you can be more blunt :

// create a byte slice with the same byte length as s
var bs = make([]byte, len(s))

// scan s byte by byte :
for i := 0; i &lt; len(s); i++ {
    switch {
    case 32 &lt;= s[i] &amp;&amp; s[i] &lt;= 126:
        bs[i] = s[i]

    // depending on your needs, you may also keep characters in the 0..31 range,
    // like &#39;tab&#39; (9), &#39;linefeed&#39; (10) or &#39;carriage return&#39; (13) :
    // case s[i] == 9, s[i] == 10, s[i] == 13:
    //   bs[i] = s[i]

    default:
        bs[i] = &#39;.&#39;
    }
}


fmt.Printf(&quot;rs: %s\n&quot;, bs)

playground

This function will give you something close to the "text" part of hexdump -C.

答案3

得分: 5

你可以使用strings.ToValidUTF8()来实现这个功能:

> ToValidUTF8函数返回一个字符串s的副本,其中每个无效的UTF-8字节序列都被替换为替换字符串,替换字符串可以为空。

它似乎正是你所需要的。测试一下:

a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
s := strings.ToValidUTF8(string(a), ".")
fmt.Println(s)

输出结果(在Go Playground上尝试):

a.b.

我说"似乎"是因为你可以看到,在ab之间只有一个点号:因为可能有2个字节,但只有一个无效的序列。

请注意,你可以避免[]byte => string的转换,因为有一个等效的bytes.ToValidUTF8()函数可以直接操作和返回[]byte

a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
a = bytes.ToValidUTF8(a, []byte{'.'})
fmt.Println(string(a))

输出结果将是相同的。在Go Playground上尝试这个例子。

如果你不喜欢多个(无效序列)字节被缩减为一个点号,可以继续阅读。

还要注意,如果你想检查可能包含文本的任意字节切片,你可以简单地使用hex.Dump()函数,它会生成如下输出:

a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
fmt.Println(hex.Dump(a))

输出结果:

00000000  61 ff af 62 bf  |a..b.|

这里是你期望的输出a..b.,还有其他(有用的)数据,如十六进制偏移和字节的十六进制表示。

为了获得一个更好的输出效果,你可以尝试使用稍长一点的输入:

a = []byte{'a', 0xff, 0xaf, 'b', 0xbf, 50: 0xff}
fmt.Println(hex.Dump(a))

输出结果:

00000000  61 ff af 62 bf 00 00 00  00 00 00 00 00 00 00 00  |a..b............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000030  00 00 ff                                          |...|

Go Playground上尝试一下吧。

英文:

You may want to use strings.ToValidUTF8() for this:

> ToValidUTF8 returns a copy of the string s with each run of invalid UTF-8 byte sequences replaced by the replacement string, which may be empty.

It "seemingly" does exactly what you need. Testing it:

a := []byte{&#39;a&#39;, 0xff, 0xaf, &#39;b&#39;, 0xbf}
s := strings.ToValidUTF8(string(a), &quot;.&quot;)
fmt.Println(s)

Output (try it on the Go Playground):

a.b.

I wrote "seemingly" because as you can see, there's a single dot between a and b: because there may be 2 bytes, but a single invalid sequence.

Note that you may avoid the []byte => string conversion, because there's a bytes.ToValidUTF8() equivalent that operates on and returns a []byte:

a := []byte{&#39;a&#39;, 0xff, 0xaf, &#39;b&#39;, 0xbf}
a = bytes.ToValidUTF8(a, []byte{&#39;.&#39;})
fmt.Println(string(a))

Output will be the same. Try this one on the Go Playground.

If it bothers you that multiple (invalid sequence) bytes may be shrinked into a single dot, read on.

Also note that to inspect arbitrary byte slices that may or may not contain texts, you may simply use hex.Dump() which generates an output like this:

a := []byte{&#39;a&#39;, 0xff, 0xaf, &#39;b&#39;, 0xbf}
fmt.Println(hex.Dump(a))

Output:

00000000  61 ff af 62 bf                                    |a..b.|

There's your expected output a..b. with other (useful) data like the hex offset and hex representation of bytes.

To get a "better" picture of the output, try it with a little longer input:

a = []byte{&#39;a&#39;, 0xff, 0xaf, &#39;b&#39;, 0xbf, 50: 0xff}
fmt.Println(hex.Dump(a))

00000000  61 ff af 62 bf 00 00 00  00 00 00 00 00 00 00 00  |a..b............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000030  00 00 ff                                          |...|

Try it on the Go Playground.

huangapple
  • 本文由 发表于 2022年1月11日 16:52:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/70663957.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定