2015年3月28日 23:05:35go评论103阅读模式

英文:

Parsing list items from html with Go

问题

我想用Go提取所有列表项（每个<li></li>的内容）。我应该使用正则表达式来获取<li>项，还是有其他库可以做到这一点？

我的意图是在Go中获得一个包含特定网页中所有列表项的列表或数组。我应该如何做到这一点？

英文:

I want to extract all list items (content of each <li></li>) with Go. Should I use regexp to get the <li> items or is there any other library for this?

My intention is to get a list or array in Go that contains all list item from a specific html web page. How should I do that?

答案1

得分: 1

你可能想要使用golang.org/x/net/html包。它不是Go标准包的一部分，而是Go子仓库的一部分（子仓库是Go项目的一部分，但不在主Go树之内。它们的开发要求比Go核心更宽松）。

文档中有一个示例，可能与你想要的类似。

如果出于某种原因你需要坚持使用Go标准包，那么对于“典型的HTML”，你可以使用encoding/xml。

这两个包通常使用io.Reader作为输入。如果你有一个string或[]byte变量，你可以使用strings.NewReader或bytes.Buffer将它们包装成io.Reader。

对于HTML，你更有可能从http.Response的主体中获取（在完成后记得关闭它）。
也许像这样：

resp, err := http.Get(someURL)
if err != nil {
    return err
}
defer resp.Body.Close()
doc, err := html.parse(resp.Body)
if err != nil {
	return err
}
// 递归访问解析树中的节点
var f func(*html.Node)
f = func(n *html.Node) {
	if n.Type == html.ElementNode && n.Data == "a" {
		for _, a := range n.Attr {
			if a.Key == "href" {
				fmt.Println(a.Val)
				break
			}
		}
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		f(c)
	}
}
f(doc)

当然，解析获取的网页对于在客户端使用JavaScript修改其内容的页面是行不通的。

英文:

You likely want to use the golang.org/x/net/html package.
It's not in the Go standard packages, but instead in the Go Sub-repositories. (The sub-repositories are part of the Go Project but outside the main Go tree. They are developed under looser compatibility requirements than the Go core.)

There is an example in that documentation that may be similar to what you want.

If you need to stick with the Go standard packages for some reason, then
for "typical HTML" you can use encoding/xml.

Both packages tend to use an io.Reader for input. If you have a string or []byte variable you can wrap them with strings.NewReader or bytes.Buffer to get an io.Reader.

For HTML it's more likely you'll come from an http.Response body
(make sure to close it when done).
Perhaps something like:

    resp, err := http.Get(someURL)
    if err != nil {
        return err
    }
    defer resp.Body.Close()
    doc, err := html.parse(resp.Body)
	if err != nil {
		return err
	}
    // Recursively visit nodes in the parse tree
	var f func(*html.Node)
	f = func(n *html.Node) {
		if n.Type == html.ElementNode &amp;&amp; n.Data == &quot;a&quot; {
			for _, a := range n.Attr {
				if a.Key == &quot;href&quot; {
					fmt.Println(a.Val)
					break
				}
			}
		}
		for c := n.FirstChild; c != nil; c = c.NextSibling {
			f(c)
		}
	}
	f(doc)
}

Of course, parsing fetched web pages won't work for pages that modify their own contents with JavaScript on the client side.

答案2

得分: 0

这是我找到的一种解决方法。

如果你想提取li元素后面的文本，你首先要找到li元素，然后将分词器移到紧接着的下一个元素，这个元素应该是文本（希望如此）。如果下一个元素是锚点、span等，你可能需要使用一些逻辑。

resp, err := http.Get(url)
if err != nil {
    log.Fatal(err)
}
defer resp.Body.Close()
z := html.NewTokenizer(bufio.NewReader(resp.Body))
for {
    tt := z.Next()
    switch tt {
    case html.ErrorToken:
        return
    case html.StartTagToken:
        t := z.Token()
        switch t.Data {
        case "li":
            z.Next()
            t = z.Token()
            fmt.Println(t.Data)
        }
    }
}

但实际上，你应该使用github.com/PuerkitoBio/goquery。

英文:

Here's one way I found to solve this.

If you're trying to extract the text after the li element you first find the li element and then move the tokenizer to the very next element which will be the text (hopefully). You may have to use some logic if the next element is an anchor, span, etc.

resp, err := http.Get(url)
if err!=nil{
    log.Fatal(err)
}
defer resp.Body.Close()
z := html.NewTokenizer(bufio.NewReader(resp.Body))
for {
    tt := z.Next()
    switch tt {
    case html.ErrorToken:
        return
    case html.StartTagToken:
        t := z.Token()
        swith t.Data {
        case &quot;li&quot;:
            z.Next()
            t = z.Token()
            fmt.Println(t.Data)
        }
    }
}

but really, you should just use github.com/PuerkitoBio/goquery

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Parsing list items from html with Go

问题

答案1

答案2

我们提交供应商文件夹的更改吗？

strconv.Itoa不接受int64类型的值。

如何避免在失败情况下重复返回InternalServerError？

_file_或_line_在golang中类似

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。