golang HTML字符集解码

huangapple go评论130阅读模式
英文:

golang HTML charset decoding

问题

我正在尝试解码非utf-8编码的HTML页面。

  1. <meta http-equiv="Content-Type" content="text/html; charset=gb2312">

有没有可以做到这一点的库?我在网上找不到一个。

附注:当然,我可以使用goquery和iconv-go提取字符集并解码HTML页面,但我想避免重复造轮子。

英文:

I'm trying to decode HTML pages that are NOT utf-8 encoded.

  1. &lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=gb2312&quot;&gt;

Is there any library that can do that? I couldn't find one online.

P.S Of course, I can extract charset and decode the HTML page with goquery and iconv-go, but I'm trying not to re-invent the wheels.

答案1

得分: 2

Golang官方提供了扩展包:charsetencoding

下面的代码确保文档可以被HTML包正确解析:

  1. func detectContentCharset(body io.Reader) string {
  2. r := bufio.NewReader(body)
  3. if data, err := r.Peek(1024); err == nil {
  4. if _, name, ok := charset.DetermineEncoding(data, ""); ok {
  5. return name
  6. }
  7. }
  8. return "utf-8"
  9. }
  10. // Decode解析指定编码的HTML主体,并返回HTML文档。
  11. func Decode(body io.Reader, charset string) (interface{}, error) {
  12. if charset == "" {
  13. charset = detectContentCharset(body)
  14. }
  15. e, err := htmlindex.Get(charset)
  16. if err != nil {
  17. return nil, err
  18. }
  19. if name, _ := htmlindex.Name(e); name != "utf-8" {
  20. body = e.NewDecoder().Reader(body)
  21. }
  22. node, err := html.Parse(body)
  23. if err != nil {
  24. return nil, err
  25. }
  26. return node, nil
  27. }
英文:

Golang officially provides the extension packages: charset and encoding.

The code below makes sure the document can be parsed correctly by the HTML package:

  1. func detectContentCharset(body io.Reader) string {
  2. r := bufio.NewReader(body)
  3. if data, err := r.Peek(1024); err == nil {
  4. if _, name, ok := charset.DetermineEncoding(data, &quot;&quot;); ok {
  5. return name
  6. }
  7. }
  8. return &quot;utf-8&quot;
  9. }
  10. // Decode parses the HTML body on the specified encoding and
  11. // returns the HTML Document.
  12. func Decode(body io.Reader, charset string) (interface{}, error) {
  13. if charset == &quot;&quot; {
  14. charset = detectContentCharset(body)
  15. }
  16. e, err := htmlindex.Get(charset)
  17. if err != nil {
  18. return nil, err
  19. }
  20. if name, _ := htmlindex.Name(e); name != &quot;utf-8&quot; {
  21. body = e.NewDecoder().Reader(body)
  22. }
  23. node, err := html.Parse(body)
  24. if err != nil {
  25. return nil, err
  26. }
  27. return node, nil
  28. }

答案2

得分: 0

goquery 可能符合你的需求。例如:

  1. import "https://github.com/PuerkitoBio/goquery"
  2. func main() {
  3. d, err := goquery.NewDocument("http://www.google.com")
  4. dh := d.Find("head")
  5. dc := dh.Find("meta[http-equiv]")
  6. c, err := dc.Attr("content") // 获取字符集
  7. // ...
  8. }

更多操作可以在 Document 结构体中找到。

英文:

goquery may meet your needs. e.g.:

  1. import &quot;https://github.com/PuerkitoBio/goquery&quot;
  2. func main() {
  3. d, err := goquery.NewDocument(&quot;http://www.google.com&quot;)
  4. dh := d.Find(&quot;head&quot;)
  5. dc := dh.Find(&quot;meta[http-equiv]&quot;)
  6. c, err := dc.Attr(&quot;content&quot;) // get charset
  7. // ...
  8. }

more operations can be found with the Document struct.

huangapple
  • 本文由 发表于 2016年4月12日 12:46:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/36563805.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定