HTML – find all the sub-tags in a given tag

huangapple go评论78阅读模式
英文:

HTML - find all the sub-tags in a given tag

问题

假设我有一个包含以下内容的HTML页面:

<ul class="good">
    <li>1</li>
    <li>2</li>
    <li>3</li>
</ul>

<ul class="bad">
    <li>a</li>
    <li>b</li>
    <li>c</li>
</ul>

我想获取第一个<ul>标签内的<li>元素。我从这里基本上复制了代码(注意:根据@twotwotwo的评论进行了编辑):

page, _ := html.Parse(httpBody)
var f func(*html.Node)
f = func(n *html.Node) {
    //fmt.Println("Inside f")
    if n.Type == html.ElementNode && n.Data == "ul" {
        fmt.Println("ul found -> ",n)
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            f(c)
        }
    } else {
        fmt.Println(n.Data ,"is not the correct one")
        for c := n.FirstChild; c != nil; c = c.NextSibling { f(c) }
    }
}
f(page)

但是我只得到了以下输出:

 is not the correct one
html is not the correct one
head is not the correct one
body is not the correct one

我想知道为什么递归在body处停止。我尝试过使用母狗网站,它在body内有标签。

P.S.
我还尝试过:

page := html.NewTokenizer(httpBody)

for {
    tokenType := page.Next()
    if tokenType == html.ErrorToken {
        return links
    }
    token := page.Token()

但是这似乎显示了所有的标记,而不关心树结构。

编辑:

英文:

Assume I have a html page that contains something like

&lt;ul class =&quot;good&quot;&gt;
    &lt;li&gt;1&lt;/li&gt;
    &lt;li&gt;2&lt;/li&gt;
    &lt;li&gt;3&lt;/li&gt;
&lt;/ul&gt;

&lt;ul class =&quot;bad&quot;&gt;
    &lt;li&gt;a&lt;/li&gt;
    &lt;li&gt;b&lt;/li&gt;
    &lt;li&gt;c&lt;/li&gt;
&lt;/ul&gt;

I want to grab the &lt;li&gt; elements inside the first &lt;ul&gt;. From here I have basically copied (note: edited code per @twotwotwo comment)

page, _ := html.Parse(httpBody)
	var f func(*html.Node)
	f = func(n *html.Node) {
		//fmt.Println(&quot;Inside f&quot;)
		if n.Type == html.ElementNode &amp;&amp; n.Data == &quot;ul&quot; {
			fmt.Println(&quot;ul found -&gt;  &quot;,n)
			for c := n.FirstChild; c != nil; c = c.NextSibling {
				f(c)
			}
		} else {
          fmt.Println(n.Data ,&quot;is not the correct one&quot;)
          for c := n.FirstChild; c != nil; c = c.NextSibling { f(c) }
          }
	}
f(page)

But the only output I obtain is

 is not the correct one
html is not the correct one
head is not the correct one
body is not the correct one

I wonder why the recursion stops at body. I have tried with motherfuckingwebsite.com which has tags inside the body

P.S.
I have also tried

page := html.NewTokenizer(httpBody)

for {
    tokenType := page.Next()
    if tokenType == html.ErrorToken {
        return links
    }
    token := page.Token()

but this seem to show all the tokens, without caring about the tree structure.

EDIT:

答案1

得分: 4

我过去使用过这个包:https://github.com/PuerkitoBio/goquery

它提供了一个类似于 jQuery 的接口,可以在 HTML 文档中进行查询。使用该库非常简单,就像这样:

import (
	"bytes"
	"fmt"
	"log"

	"github.com/PuerkitoBio/goquery"
)

var httpBody string = `
	<ul class="good">
	    <li>1</li>
	    <li>2</li>
	    <li>3</li>
	</ul>

	<ul class="bad">
	    <li>a</li>
	    <li>b</li>
	    <li>c</li>
	</ul>
`

func main() {
	b := bytes.NewBufferString(httpBody)
	doc, err := goquery.NewDocumentFromReader(b)
	if err != nil {
		log.Fatal(err)
	}

	doc.Find("ul.good").Each(func(i int, ul *goquery.Selection) {
		ul.Find("li").Each(func(i int, li *goquery.Selection) {
			fmt.Println(li.Text())
		})
	})
}

这将打印出:

1
2
3
英文:

I have, in the past, used this package: https://github.com/PuerkitoBio/goquery

It provides a "jQuery-like" interface/querying across HTML documents. With that library, its as simple as this:

import (
	&quot;bytes&quot;
	&quot;fmt&quot;
	&quot;log&quot;

	&quot;github.com/PuerkitoBio/goquery&quot;
)

var httpBody string = `
	&lt;ul class =&quot;good&quot;&gt;
	    &lt;li&gt;1&lt;/li&gt;
	    &lt;li&gt;2&lt;/li&gt;
	    &lt;li&gt;3&lt;/li&gt;
	&lt;/ul&gt;

	&lt;ul class =&quot;bad&quot;&gt;
	    &lt;li&gt;a&lt;/li&gt;
	    &lt;li&gt;b&lt;/li&gt;
	    &lt;li&gt;c&lt;/li&gt;
	&lt;/ul&gt;
`

func main() {
	b := bytes.NewBufferString(httpBody)
	doc, err := goquery.NewDocumentFromReader(b)
	if err != nil {
		log.Fatal(err)
	}

	doc.Find(&quot;ul.good&quot;).Each(func(i int, ul *goquery.Selection) {
		ul.Find(&quot;li&quot;).Each(func(i int, li *goquery.Selection) {
			fmt.Println(li.Text())
		})
	})
}

Which prints:

1
2
3

huangapple
  • 本文由 发表于 2014年10月1日 10:50:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/26133381.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定