Go编码转换问题

huangapple go评论88阅读模式
英文:

Go encoding transform issue

问题

我在Go语言中有以下代码:

import (
    "log"
    "net/http"
    "code.google.com/p/go.text/transform"
    "code.google.com/p/go.text/encoding/charmap"
)

...

res, err := http.Get(url)
if err != nil {
    log.Println("无法读取", url)
    log.Println(err)
    continue
}
defer res.Body.Close()

我加载的页面包含非UTF-8符号所以我尝试使用`transform`

utfBody := transform.NewReader(res.Body, charmap.Windows1251.NewDecoder())

但问题是即使在这个简单的场景中它也会返回错误

bytes, err := ioutil.ReadAll(utfBody)
log.Println(err)
if err == nil {
    log.Println(bytes)
}

`transform: short destination buffer`

实际上它还将`bytes`设置为一些数据但在我的真实代码中我使用了`goquery`

doc, err := goquery.NewDocumentFromReader(utfBody)

它看到了一个错误并且没有返回任何数据

我尝试将`res.Body`传递给`transform.NewReader`并发现只要`res.Body`不包含非UTF-8数据它就能正常工作当它包含非UTF-8字节时它会失败并显示上述错误

我对Go语言还不太熟悉不太明白发生了什么以及如何处理这个问题

<details>
<summary>英文:</summary>

I have a following code in go:

    import (
        &quot;log&quot;
        &quot;net/http&quot;
        &quot;code.google.com/p/go.text/transform&quot;
        &quot;code.google.com/p/go.text/encoding/charmap&quot;
   )
    
    ...
    
    res, err := http.Get(url)
    if err != nil {
        log.Println(&quot;Cannot read&quot;, url);
        log.Println(err);
        continue
    }
	defer res.Body.Close()

The page I load contain non UTF-8 symbols. So I try to use `transform`

    utfBody := transform.NewReader(res.Body, charmap.Windows1251.NewDecoder())

But the problem is, that it returns error even in this simple scenarion:

    bytes, err := ioutil.ReadAll(utfBody)
    log.Println(err)
    if err == nil {
        log.Println(bytes)
    }

`transform: short destination buffer`

It also actually sets `bytes` with some data, but in my real code I use `goquery`:

    doc, err := goquery.NewDocumentFromReader(utfBody)

Which sees an error and fails with not data in return

I tried to pass &quot;chunks&quot; of `res.Body` to `transform.NewReader` and figuried out, that as long as res.Body contains no non-UTF8 data it works well. And when it contains non-UTF8 byte it fails with an error above.

I&#39;m quite new to go and don&#39;t really understand what&#39;s going on and how to deal with this

</details>


# 答案1
**得分**: 7

没有整个代码和示例URL很难确定出现了什么问题

话虽如此我可以推荐使用[`golang.org/x/net/html/charset`](https://godoc.org/golang.org/x/net/html/charset)包来解决这个问题,因为它支持*字符猜测*和转换为UTF-8。

```go
func fetchUtf8Bytes(url string) ([]byte, error) {
    res, err := http.Get(url)
    if err != nil {
        return nil, err
    }

    contentType := res.Header.Get("Content-Type") // 可选,更好的猜测
    utf8reader, err := charset.NewReader(res.Body, contentType)
    if err != nil {
        return nil, err
    }

    return ioutil.ReadAll(utf8reader)
}

完整示例:http://play.golang.org/p/olcBM9ughv

英文:

Without the whole code along with an example URL it's hard to tell what exactly is going wrong here.

That said, I can recommend the golang.org/x/net/html/charset package for this as it supports both char guessing and converting to UTF 8.

func fetchUtf8Bytes(url string) ([]byte, error) {
res, err := http.Get(url)
if err != nil {
return nil, err
}
contentType := res.Header.Get(&quot;Content-Type&quot;) // Optional, better guessing
utf8reader, err := charset.NewReader(res.Body, contentType)
if err != nil {
return nil, err
}
return ioutil.ReadAll(utf8reader)
}

Complete example: http://play.golang.org/p/olcBM9ughv

huangapple
  • 本文由 发表于 2015年7月27日 19:03:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/31651410.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定