在Go语言中,与Python的HTML解析函数/模块相当的是什么?

huangapple go评论92阅读模式
英文:

Equivalent to Python's HTML parsing function/module in Go?

问题

我现在正在学习Go,并且在获取和解析HTML/XML方面遇到了困难。在Python中,当我进行网页抓取时,通常会编写以下代码:

from urllib.request import urlopen, Request
url = "http://stackoverflow.com/"
req = Request(url)
html = urlopen(req).read()

然后我可以得到以stringbytes形式表示的原始HTML/XML,并继续处理它。在Go中,我该如何处理呢?我希望得到的是以string[]byte形式存储的原始HTML数据(虽然可以轻松转换,但我不介意得到哪种形式)。我考虑使用gokogiri包来进行Go语言的网页抓取(不确定最终是否会使用它!),但它似乎需要在处理之前获得原始HTML文本...

那么我该如何获取这样的对象呢?

或者在Go中有更好的方法来进行网页抓取吗?

谢谢。

英文:

I'm now learning Go myself and am stuck in getting and parsing HTML/XML. In Python, I usually write the following code when I do web scraping:

from urllib.request import urlopen, Request
url = "http://stackoverflow.com/"
req = Request(url)
html = urlopen(req).read()

, then I can get raw HTML/XML in a form of either string or bytes and proceed to work with it. In Go, how can I cope with it? What I hope to get is raw HTML data which is stored either in string or []byte (though it can be easily converted, that I don't mind which to get at all). I consider using gokogiri package to do web scraping in Go (not sure I'll indeed end up with using it!), but it looks like it requires raw HTML text before doing any work with it...

So how can I acquire such object?

Or is there any better way to do web scraping work in Go?

Thanks.

答案1

得分: 2

Go http.Get示例中:

package main

import (
    "fmt"
    "io/ioutil"
    "log"
    "net/http"
)

func main() {
    res, err := http.Get("http://www.google.com/robots.txt")
    if err != nil {
        log.Fatal(err)
    }
    robots, err := ioutil.ReadAll(res.Body)
    res.Body.Close()
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("%s", robots)
}

将返回http://www.google.com/robots.txt的内容存入字符串变量robots

要进行XML解析,请查看Go encoding/xml

英文:

From the Go http.Get Example:

package main

import (
	"fmt"
	"io/ioutil"
	"log"
	"net/http"
)

func main() {
	res, err := http.Get("http://www.google.com/robots.txt")
	if err != nil {
		log.Fatal(err)
	}
	robots, err := ioutil.ReadAll(res.Body)
	res.Body.Close()
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("%s", robots)
}

Will return the contents of http://www.google.com/robots.txt into the string variable robots.

For XML parsing look into the Go encoding/xml package.

huangapple
  • 本文由 发表于 2013年9月3日 11:45:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/18583742.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定