html.Parse函数返回nil而不是解析后的HTML。

huangapple go评论91阅读模式
英文:

html.Parse function returns nil instead of parsed html

问题

我开始学习Go,并尝试运行这个程序,但是来自golang.org/x/net/htmlhtml.Parse在我尝试获取解析后的HTML时返回nil。我尝试了不同的方法,但是我无法找出问题出在哪里,所以如果有人能解释一下内部发生了什么,我将不胜感激。

我正在使用Go版本1.13.8,我的操作系统是Ubuntu 20.4 LTS。

当我打印doc时,我得到以下消息:

&{<nil> 0xc0000ca070 0xc0000ca0e0 <nil> <nil> 2    []}
英文:

I started to learn Go and I try to run this program but html.Parse from golang.org/x/net/html returns nil when I try to get parsed HTML. I try different things but I can't find out what's going on, so I appreciate it if someone explains what happens under the hood, thanks.

package main

import (
	"fmt"
	"os"
    "golang.org/x/net/html"
)

func main() {
	doc, err := html.Parse(os.Stdin)
	if err != nil {
		fmt.Fprintf(os.Stderr, "findlinks1: %v\n", err)
		os.Exit(1)
	}
	fmt.Println(doc)
	for _, link := range visit(nil, doc) {
		fmt.Printf("link is %v", link)
		fmt.Println(link)
	}

func visit(links []string, n *html.Node) []string {
	if n.Type == html.ElementNode && n.Data == "a" {
		for _, a := range n.Attr {
			if a.Key == "href" {
				links = append(links, a.Val)
			}
		}
	}
	if c := n.FirstChild; c != nil {
		c = c.NextSibling
		links = visit(links, c)
	}
	return links
}

I'm using go version 1.13.8 and my operating system is Ubuntu 20.4 LTS.
When I print doc I get this message:

&{<nil> 0xc0000ca070 0xc0000ca0e0 <nil> <nil> 2    []}

答案1

得分: 2

您的解析文档不是nil,否则您只会看到打印的nil,而不是类似&{...}的内容。

访问所有子节点是一个循环,但您只检查n节点是否有第一个子节点,如果有,您甚至不使用它,而是遍历下一个兄弟节点。这没有意义。

要访问所有子节点,请使用以下循环:

for c := n.FirstChild; c != nil; c = c.NextSibling {
    links = visit(links, c)
}

进行测试:

s := `<a href="http://first.com">first</a><b><a href="http://second.com">second</a></b>`
doc, err := html.Parse(strings.NewReader(s))
if err != nil {
    fmt.Fprintf(os.Stderr, "findlinks1: %v\n", err)
    os.Exit(1)
}
fmt.Println(doc)
for _, link := range visit(doc) {
    fmt.Println("link is", link)
}

输出结果(在Go Playground上尝试):

&{<nil> 0xc00012e070 0xc00012e070 <nil> <nil> 2    []}
link is http://first.com
link is http://second.com
英文:

Your parsed document isn't nil, else you'd only see printed nil and not something like &{...}.

Visiting all children is a loop, yet you only check if the n node has a first child, and if it does, you don't even use it but traverse the next sibling. This makes no sense.

To visit all children, use a loop like this:

for c := n.FirstChild; c != nil; c = c.NextSibling {
	links = visit(links, c)
}

Testing it:

s := `<a href="http://first.com">first</a><b><a href="http://second.com">second</a></b>`
doc, err := html.Parse(strings.NewReader(s))
if err != nil {
	fmt.Fprintf(os.Stderr, "findlinks1: %v\n", err)
	os.Exit(1)
}
fmt.Println(doc)
for _, link := range visit(doc) {
	fmt.Println("link is", link)
}

Which outputs (try it on the Go Playground):

&{<nil> 0xc00012e070 0xc00012e070 <nil> <nil> 2    []}
link is http://first.com
link is http://second.com

huangapple
  • 本文由 发表于 2021年5月28日 14:40:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/67733994.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定