如何获取html.Node的内容?

huangapple go评论81阅读模式
英文:

How can i get the content of an html.Node

问题

我可以帮你翻译这段代码。这段代码使用了第三方库GO,从http://godoc.org/code.google.com/p/go.net/html获取URL的数据。但是我遇到了一个问题,就是无法获取html.Node的内容。

在参考文档中有一个示例代码,以下是代码:

s := `<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>`
doc, err := html.Parse(strings.NewReader(s))
if err != nil {
    log.Fatal(err)
}
var f func(*html.Node)
f = func(n *html.Node) {
    if n.Type == html.ElementNode && n.Data == "a" {
        for _, a := range n.Attr {
            if a.Key == "href" {
                fmt.Println(a.Val)
                break
            }
        }
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        f(c)
    }
}
f(doc)

输出结果为:

foo
/bar/baz

如果你想要获取:

Foo
BarBaz

你应该怎么做呢?

英文:

I would like to get data from a URL using the GO 3rd party library from http://godoc.org/code.google.com/p/go.net/html . But I came across a problem, that is I couldn't get the content of an html.Node.

There's an example code in the reference document, and here's the code.

s := `<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>`
doc, err := html.Parse(strings.NewReader(s))
if err != nil {
    log.Fatal(err)
}
var f func(*html.Node)
f = func(n *html.Node) {
    if n.Type == html.ElementNode && n.Data == "a" {
        for _, a := range n.Attr {
            if a.Key == "href" {
                fmt.Println(a.Val)
                break
            }
        }
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        f(c)
    }
}
f(doc)

The output is:

foo
/bar/baz

If I want to get

Foo
BarBaz

What should I do?

答案1

得分: 10

<a href="link"><strong>Foo</strong>Bar</a>的树形结构基本上是这样的:

  • ElementNode "a"(该节点还包括属性列表)
    • ElementNode "strong"
      • TextNode "Foo"
    • TextNode "Bar"

所以,假设你想要获取链接的纯文本(例如FooBar),你需要遍历整个树并收集所有的文本节点。例如:

func collectText(n *html.Node, buf *bytes.Buffer) {
    if n.Type == html.TextNode {
        buf.WriteString(n.Data)
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        collectText(c, buf)
    }
}

然后在你的函数中进行以下更改:

var f func(*html.Node)
f = func(n *html.Node) {
    if n.Type == html.ElementNode && n.Data == "a" {
        text := &bytes.Buffer{}
        collectText(n, text)
        fmt.Println(text)
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        f(c)
    }
}
英文:

The tree of <a href="link"><strong>Foo</strong>Bar</a> looks basically like this:

  • ElementNode "a" (this node also includes a list off attributes)
    • ElementNode "strong"
      • TextNode "Foo"
    • TextNode "Bar"

So, assuming that you want to get the plain text of the link (e.g. FooBar) you would have to walk trough the tree and collect all text nodes. For example:

func collectText(n *html.Node, buf *bytes.Buffer) {
	if n.Type == html.TextNode {
		buf.WriteString(n.Data)
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		collectText(c, buf)
	}
}

And the changes in your function:

var f func(*html.Node)
f = func(n *html.Node) {
	if n.Type == html.ElementNode && n.Data == "a" {
		text := &bytes.Buffer{}
		collectText(n, text)
		fmt.Println(text)
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		f(c)
	}
}

huangapple
  • 本文由 发表于 2013年8月16日 21:29:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/18274501.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定