golang HTML字符集解码

huangapple go评论83阅读模式
英文:

golang HTML charset decoding

问题

我正在尝试解码非utf-8编码的HTML页面。

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

有没有可以做到这一点的库?我在网上找不到一个。

附注:当然,我可以使用goquery和iconv-go提取字符集并解码HTML页面,但我想避免重复造轮子。

英文:

I'm trying to decode HTML pages that are NOT utf-8 encoded.

&lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=gb2312&quot;&gt;

Is there any library that can do that? I couldn't find one online.

P.S Of course, I can extract charset and decode the HTML page with goquery and iconv-go, but I'm trying not to re-invent the wheels.

答案1

得分: 2

Golang官方提供了扩展包:charsetencoding

下面的代码确保文档可以被HTML包正确解析:

func detectContentCharset(body io.Reader) string {
    r := bufio.NewReader(body)
    if data, err := r.Peek(1024); err == nil {
        if _, name, ok := charset.DetermineEncoding(data, ""); ok {
            return name
        }
    }
    return "utf-8"
}

// Decode解析指定编码的HTML主体,并返回HTML文档。
func Decode(body io.Reader, charset string) (interface{}, error) {
    if charset == "" {
        charset = detectContentCharset(body)
    }
    e, err := htmlindex.Get(charset)
    if err != nil {
        return nil, err
    }

    if name, _ := htmlindex.Name(e); name != "utf-8" {
        body = e.NewDecoder().Reader(body)
    }

    node, err := html.Parse(body)
    if err != nil {
        return nil, err
    }
    return node, nil
}
英文:

Golang officially provides the extension packages: charset and encoding.

The code below makes sure the document can be parsed correctly by the HTML package:

func detectContentCharset(body io.Reader) string {
    r := bufio.NewReader(body)
    if data, err := r.Peek(1024); err == nil {
	    if _, name, ok := charset.DetermineEncoding(data, &quot;&quot;); ok {
	    	return name
	    }
    }
    return &quot;utf-8&quot;
}

// Decode parses the HTML body on the specified encoding and
// returns the HTML Document.
func Decode(body io.Reader, charset string) (interface{}, error) {
    if charset == &quot;&quot; {
	    charset = detectContentCharset(body)
    }
    e, err := htmlindex.Get(charset)
    if err != nil {
	    return nil, err
    }

    if name, _ := htmlindex.Name(e); name != &quot;utf-8&quot; {
	    body = e.NewDecoder().Reader(body)
    }

    node, err := html.Parse(body)
    if err != nil {
	    return nil, err
    }
    return node, nil
}

答案2

得分: 0

goquery 可能符合你的需求。例如:

import "https://github.com/PuerkitoBio/goquery"

func main() {
    d, err := goquery.NewDocument("http://www.google.com")
    dh := d.Find("head")
    dc := dh.Find("meta[http-equiv]")
    c, err := dc.Attr("content") // 获取字符集
    // ...
}

更多操作可以在 Document 结构体中找到。

英文:

goquery may meet your needs. e.g.:

import &quot;https://github.com/PuerkitoBio/goquery&quot;

func main() {
    d, err := goquery.NewDocument(&quot;http://www.google.com&quot;)
    dh := d.Find(&quot;head&quot;)
    dc := dh.Find(&quot;meta[http-equiv]&quot;)
    c, err := dc.Attr(&quot;content&quot;) // get charset
    // ...
}

more operations can be found with the Document struct.

huangapple
  • 本文由 发表于 2016年4月12日 12:46:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/36563805.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定