英文:
golang HTML charset decoding
问题
我正在尝试解码非utf-8编码的HTML页面。
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
有没有可以做到这一点的库?我在网上找不到一个。
附注:当然,我可以使用goquery和iconv-go提取字符集并解码HTML页面,但我想避免重复造轮子。
英文:
I'm trying to decode HTML pages that are NOT utf-8 encoded.
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
Is there any library that can do that? I couldn't find one online.
P.S Of course, I can extract charset and decode the HTML page with goquery and iconv-go, but I'm trying not to re-invent the wheels.
答案1
得分: 2
Golang官方提供了扩展包:charset和encoding。
下面的代码确保文档可以被HTML包正确解析:
func detectContentCharset(body io.Reader) string {
r := bufio.NewReader(body)
if data, err := r.Peek(1024); err == nil {
if _, name, ok := charset.DetermineEncoding(data, ""); ok {
return name
}
}
return "utf-8"
}
// Decode解析指定编码的HTML主体,并返回HTML文档。
func Decode(body io.Reader, charset string) (interface{}, error) {
if charset == "" {
charset = detectContentCharset(body)
}
e, err := htmlindex.Get(charset)
if err != nil {
return nil, err
}
if name, _ := htmlindex.Name(e); name != "utf-8" {
body = e.NewDecoder().Reader(body)
}
node, err := html.Parse(body)
if err != nil {
return nil, err
}
return node, nil
}
英文:
Golang officially provides the extension packages: charset and encoding.
The code below makes sure the document can be parsed correctly by the HTML package:
func detectContentCharset(body io.Reader) string {
r := bufio.NewReader(body)
if data, err := r.Peek(1024); err == nil {
if _, name, ok := charset.DetermineEncoding(data, ""); ok {
return name
}
}
return "utf-8"
}
// Decode parses the HTML body on the specified encoding and
// returns the HTML Document.
func Decode(body io.Reader, charset string) (interface{}, error) {
if charset == "" {
charset = detectContentCharset(body)
}
e, err := htmlindex.Get(charset)
if err != nil {
return nil, err
}
if name, _ := htmlindex.Name(e); name != "utf-8" {
body = e.NewDecoder().Reader(body)
}
node, err := html.Parse(body)
if err != nil {
return nil, err
}
return node, nil
}
答案2
得分: 0
goquery 可能符合你的需求。例如:
import "https://github.com/PuerkitoBio/goquery"
func main() {
d, err := goquery.NewDocument("http://www.google.com")
dh := d.Find("head")
dc := dh.Find("meta[http-equiv]")
c, err := dc.Attr("content") // 获取字符集
// ...
}
更多操作可以在 Document 结构体中找到。
英文:
goquery may meet your needs. e.g.:
import "https://github.com/PuerkitoBio/goquery"
func main() {
d, err := goquery.NewDocument("http://www.google.com")
dh := d.Find("head")
dc := dh.Find("meta[http-equiv]")
c, err := dc.Attr("content") // get charset
// ...
}
more operations can be found with the Document struct.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论