在使用Go解码文本时,是否可以忽略非法字节?

huangapple go评论90阅读模式
英文:

Ignore illegal bytes when decoding text with Go?

问题

我正在转换一个解码电子邮件消息的Go程序。目前它使用iconv来进行实际的解码,这当然会有开销。我想使用golang.org/x/text/transformgolang.org/x/net/html/charset包来完成这个任务。以下是工作的代码:

// cs是从Content-Type声明中获取的电子邮件正文所使用的字符集。
enc, name := charset.Lookup(cs)
if enc == nil {
    log.Fatalf("Can't find %s", cs)
}
// body是我们要转换为utf-8的电子邮件正文
r := transform.NewReader(strings.NewReader(body), enc.NewDecoder())

// result包含转换为utf8的电子邮件正文
result, err := ioutil.ReadAll(r)

这段代码很好地工作,除了遇到非法字节时。不幸的是,在处理来自电子邮件的内容时,遇到非法字节并不少见。ioutil.ReadAll()会返回一个错误和在问题出现之前转换的所有字节。有没有办法告诉transform包忽略非法字节?目前,我们使用iconv的-c标志来实现这一点。我已经查阅了transform包的文档,但无法确定是否可能实现这一点。

更新:
下面是一个展示问题的测试程序(Go Playground上没有charset或transform包...)。原始文本是从一封实际电子邮件中提取的。是的,它是用英语编写的,是的,电子邮件中的字符集设置为EUC-KR。我希望它忽略那个撇号。

package main

import (
    "io/ioutil"
    "log"
    "strings"

    "golang.org/x/net/html/charset"
    "golang.org/x/text/transform"
)

func main() {
    raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.`
    enc, _ := charset.Lookup("euc-kr")
    r := transform.NewReader(strings.NewReader(raw), enc.NewDecoder())
    result, err := ioutil.ReadAll(r)
    if err != nil {
        log.Printf("ReadAll returned %s", err)
    }
    log.Printf("RESULT: '%s'", string(result))
}

希望对你有所帮助!

英文:

I'm converting a Go program that decodes email messages. It currently runs iconv to do the actual decoding, which of course has overhead. I would like to use the golang.org/x/text/transform and golang.org/x/net/html/charset packages to do this. Here is working code:

// cs is the charset that the email body is encoded with, pulled from
// the Content-Type declaration.
enc, name := charset.Lookup(cs)
if enc == nil {
	log.Fatalf("Can't find %s", cs)
}
// body is the email body we're converting to utf-8
r := transform.NewReader(strings.NewReader(body), enc.NewDecoder())

// result contains the converted-to-utf8 email body
result, err := ioutil.ReadAll(r)

That works great except for when it encounters illegal bytes, which unfortunately is not an uncommon experience when dealing with email in the wild. ioutil.ReadAll() returns an error and all the converted bytes up until the problem. Is there a way to tell the transform package to ignore illegal bytes? Right now, we use the -c flag to iconv to do that. I've gone through the docs for the transform package, and I can't tell if it's possible or not.

UPDATE:
Here's a test program that shows the problem (the Go playground doesn't have the charset or transform packages...). The raw text was taken from an actual email. Yep, it's in English, and yep, the charset in the email was set to EUC-KR. I need it to ignore that apostrophe.

package main

import (
    "io/ioutil"
    "log"
    "strings"

    "golang.org/x/net/html/charset"
    "golang.org/x/text/transform"
)

func main() {
    raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.`
    enc, _ := charset.Lookup("euc-kr")
    r := transform.NewReader(strings.NewReader(raw), enc.NewDecoder())
    result, err := ioutil.ReadAll(r)
    if err != nil {
        log.Printf("ReadAll returned %s", err)
    }
    log.Printf("RESULT: '%s'", string(result))
}

答案1

得分: 4

enc.NewDecoder() 会返回一个 transform.TransformerNewDecoder() 的文档中说:

> 对不是该编码的源字节进行转换不会导致错误本身。无法转码的每个字节都将在输出中用 '\uFFFD' 的 UTF-8 编码表示,即替换符。

这告诉我们是读取器在替换符(也称为错误符)上出现了错误。幸运的是,我们可以很容易地去除这些替换符。

golang.org/x/text/transform 提供了两个辅助函数,我们可以用它们来解决这个问题。Chain() 接受一组转换器并将它们链接在一起。RemoveFunc() 接受一个函数,并过滤掉该函数返回 true 的所有字节。

下面的代码(未经测试)应该可以解决这个问题:

filter := transform.Chain(enc.NewDecoder(), transform.RemoveFunc(func (r rune) bool {
    return r == utf8.RuneError
}))
r := transform.NewReader(strings.NewReader(body), filter)

这样就可以在字节到达读取器之前过滤掉所有的替换符错误。

英文:

enc.NewDecoder() results in a transform.Transformer. The doc of NewDecoder() says:

> Transforming source bytes that are not of that encoding will not result in an error per se. Each byte that cannot be transcoded will be represented in the output by the UTF-8 encoding of '\uFFFD', the replacement rune.

This tells us it is the reader failing on the replacement rune (also known as the error rune). Fortunately it is easy to strip those out.

golang.org/x/text/transform provides two helper functions we can use to solve this problem. Chain() takes a set of transformers and chains them together. RemoveFunc() takes a function and filters out all bytes for which it returns true.

Something like the following (untested) should work:

filter := transform.Chain(enc.NewDecoder(), transform.RemoveFunc(func (r rune) bool {
    return r == utf8.RuneError
}))
r := transform.NewReader(strings.NewReader(body), filter)

That should filter out all rune-errors before they get to the reader and blow up.

答案2

得分: 0

这是我采用的解决方案。我不使用Reader,而是手动分配目标缓冲区,并直接调用Transform()函数。当Transform()出错时,我检查目标缓冲区是否太短,如果需要的话重新分配。否则,我跳过一个rune,假设它是非法字符。为了完整起见,我还应该检查输入缓冲区是否太短,但在这个示例中我没有这样做。

raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.`
enc, _ := charset.Lookup("euc-kr")
dst := make([]byte, len(raw))
d := enc.NewDecoder()

var (
    in  int
    out int
)
for in < len(raw) {
    // 进行转换
    ndst, nsrc, err := d.Transform(dst[out:], []byte(raw[in:]), true)
    in += nsrc
    out += ndst
    if err == nil {
        // 转换完成
        break
    }
    if err == transform.ErrShortDst {
        // 输出缓冲区太小,需要扩大
        log.Printf("Short")
        t := make([]byte, (cap(dst)+1)*2)
        copy(t, dst)
        dst = t
        continue
    }
    // 我们在这里是因为至少有一个非法字符。跳过当前rune并重试。
    _, width := utf8.DecodeRuneInString(raw[in:])
    in += width
}
英文:

Here is the solution I went with. Instead of using a Reader, I allocate the destination buffer by hand and call the Transform() function directly. When Transform() errors out, I check for a short destination buffer, and reallocate if necessary. Otherwise I skip a rune, assuming that it is the illegal character. For completeness, I should also check for a short input buffer, but I do not do so in this example.

raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.`
enc, _ := charset.Lookup(&quot;euc-kr&quot;)
dst := make([]byte, len(raw))
d := enc.NewDecoder()

var (
    in  int
    out int
)
for in &lt; len(raw) {
    // Do the transformation
    ndst, nsrc, err := d.Transform(dst[out:], []byte(raw[in:]), true)
    in += nsrc
    out += ndst
    if err == nil {
        // Completed transformation
        break
    }
    if err == transform.ErrShortDst {
        // Our output buffer is too small, so we need to grow it
        log.Printf(&quot;Short&quot;)
        t := make([]byte, (cap(dst)+1)*2)
        copy(t, dst)
        dst = t
        continue
    }
    // We&#39;re here because of at least one illegal character. Skip over the current rune
    // and try again.
    _, width := utf8.DecodeRuneInString(raw[in:])
    in += width
}

huangapple
  • 本文由 发表于 2015年9月11日 06:15:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/32512500.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定