Parsing list items from html with Go

huangapple go评论84阅读模式
英文:

Parsing list items from html with Go

问题

我想用Go提取所有列表项(每个<li></li>的内容)。我应该使用正则表达式来获取<li>项,还是有其他库可以做到这一点?

我的意图是在Go中获得一个包含特定网页中所有列表项的列表或数组。我应该如何做到这一点?

英文:

I want to extract all list items (content of each &lt;li&gt;&lt;/li&gt;) with Go. Should I use regexp to get the &lt;li&gt; items or is there any other library for this?

My intention is to get a list or array in Go that contains all list item from a specific html web page. How should I do that?

答案1

得分: 1

你可能想要使用golang.org/x/net/html包。它不是Go标准包的一部分,而是Go子仓库的一部分(子仓库是Go项目的一部分,但不在主Go树之内。它们的开发要求比Go核心更宽松)。

文档中有一个示例,可能与你想要的类似。

如果出于某种原因你需要坚持使用Go标准包,那么对于“典型的HTML”,你可以使用encoding/xml

这两个包通常使用io.Reader作为输入。如果你有一个string[]byte变量,你可以使用strings.NewReaderbytes.Buffer将它们包装成io.Reader

对于HTML,你更有可能从http.Response的主体中获取(在完成后记得关闭它)。
也许像这样:

resp, err := http.Get(someURL)
if err != nil {
    return err
}
defer resp.Body.Close()

doc, err := html.parse(resp.Body)
if err != nil {
	return err
}
// 递归访问解析树中的节点
var f func(*html.Node)
f = func(n *html.Node) {
	if n.Type == html.ElementNode && n.Data == "a" {
		for _, a := range n.Attr {
			if a.Key == "href" {
				fmt.Println(a.Val)
				break
			}
		}
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		f(c)
	}
}
f(doc)

当然,解析获取的网页对于在客户端使用JavaScript修改其内容的页面是行不通的。

英文:

You likely want to use the golang.org/x/net/html package.
It's not in the Go standard packages, but instead in the Go Sub-repositories. (The sub-repositories are part of the Go Project but outside the main Go tree. They are developed under looser compatibility requirements than the Go core.)

There is an example in that documentation that may be similar to what you want.

If you need to stick with the Go standard packages for some reason, then
for "typical HTML" you can use encoding/xml.

Both packages tend to use an io.Reader for input. If you have a string or []byte variable you can wrap them with strings.NewReader or bytes.Buffer to get an io.Reader.

For HTML it's more likely you'll come from an http.Response body
(make sure to close it when done).
Perhaps something like:

    resp, err := http.Get(someURL)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    doc, err := html.parse(resp.Body)
	if err != nil {
		return err
	}
    // Recursively visit nodes in the parse tree
	var f func(*html.Node)
	f = func(n *html.Node) {
		if n.Type == html.ElementNode &amp;&amp; n.Data == &quot;a&quot; {
			for _, a := range n.Attr {
				if a.Key == &quot;href&quot; {
					fmt.Println(a.Val)
					break
				}
			}
		}
		for c := n.FirstChild; c != nil; c = c.NextSibling {
			f(c)
		}
	}
	f(doc)
}

Of course, parsing fetched web pages won't work for pages that modify their own contents with JavaScript on the client side.

答案2

得分: 0

这是我找到的一种解决方法。

如果你想提取li元素后面的文本,你首先要找到li元素,然后将分词器移到紧接着的下一个元素,这个元素应该是文本(希望如此)。如果下一个元素是锚点、span等,你可能需要使用一些逻辑。

resp, err := http.Get(url)
if err != nil {
    log.Fatal(err)
}
defer resp.Body.Close()

z := html.NewTokenizer(bufio.NewReader(resp.Body))
for {
    tt := z.Next()
    switch tt {
    case html.ErrorToken:
        return
    case html.StartTagToken:
        t := z.Token()
        switch t.Data {
        case "li":
            z.Next()
            t = z.Token()
            fmt.Println(t.Data)
        }
    }
}

但实际上,你应该使用github.com/PuerkitoBio/goquery

英文:

Here's one way I found to solve this.

If you're trying to extract the text after the li element you first find the li element and then move the tokenizer to the very next element which will be the text (hopefully). You may have to use some logic if the next element is an anchor, span, etc.

resp, err := http.Get(url)
if err!=nil{
    log.Fatal(err)
}
defer resp.Body.Close()

z := html.NewTokenizer(bufio.NewReader(resp.Body))
for {
    tt := z.Next()
    switch tt {
    case html.ErrorToken:
        return
    case html.StartTagToken:
        t := z.Token()
        swith t.Data {
        case &quot;li&quot;:
            z.Next()
            t = z.Token()
            fmt.Println(t.Data)
        }
    }
}

but really, you should just use github.com/PuerkitoBio/goquery

huangapple
  • 本文由 发表于 2015年3月28日 23:05:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/29318672.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定