英文:
HTML - find all the sub-tags in a given tag
问题
假设我有一个包含以下内容的HTML页面:
<ul class="good">
<li>1</li>
<li>2</li>
<li>3</li>
</ul>
<ul class="bad">
<li>a</li>
<li>b</li>
<li>c</li>
</ul>
我想获取第一个<ul>
标签内的<li>
元素。我从这里基本上复制了代码(注意:根据@twotwotwo的评论进行了编辑):
page, _ := html.Parse(httpBody)
var f func(*html.Node)
f = func(n *html.Node) {
//fmt.Println("Inside f")
if n.Type == html.ElementNode && n.Data == "ul" {
fmt.Println("ul found -> ",n)
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
} else {
fmt.Println(n.Data ,"is not the correct one")
for c := n.FirstChild; c != nil; c = c.NextSibling { f(c) }
}
}
f(page)
但是我只得到了以下输出:
is not the correct one
html is not the correct one
head is not the correct one
body is not the correct one
我想知道为什么递归在body
处停止。我尝试过使用母狗网站,它在body
内有标签。
P.S.
我还尝试过:
page := html.NewTokenizer(httpBody)
for {
tokenType := page.Next()
if tokenType == html.ErrorToken {
return links
}
token := page.Token()
但是这似乎显示了所有的标记,而不关心树结构。
编辑:
英文:
Assume I have a html page that contains something like
<ul class ="good">
<li>1</li>
<li>2</li>
<li>3</li>
</ul>
<ul class ="bad">
<li>a</li>
<li>b</li>
<li>c</li>
</ul>
I want to grab the <li>
elements inside the first <ul>
. From here I have basically copied (note: edited code per @twotwotwo comment)
page, _ := html.Parse(httpBody)
var f func(*html.Node)
f = func(n *html.Node) {
//fmt.Println("Inside f")
if n.Type == html.ElementNode && n.Data == "ul" {
fmt.Println("ul found -> ",n)
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
} else {
fmt.Println(n.Data ,"is not the correct one")
for c := n.FirstChild; c != nil; c = c.NextSibling { f(c) }
}
}
f(page)
But the only output I obtain is
is not the correct one
html is not the correct one
head is not the correct one
body is not the correct one
I wonder why the recursion stops at body. I have tried with motherfuckingwebsite.com which has tags inside the body
P.S.
I have also tried
page := html.NewTokenizer(httpBody)
for {
tokenType := page.Next()
if tokenType == html.ErrorToken {
return links
}
token := page.Token()
but this seem to show all the tokens, without caring about the tree structure.
EDIT:
答案1
得分: 4
我过去使用过这个包:https://github.com/PuerkitoBio/goquery
它提供了一个类似于 jQuery 的接口,可以在 HTML 文档中进行查询。使用该库非常简单,就像这样:
import (
"bytes"
"fmt"
"log"
"github.com/PuerkitoBio/goquery"
)
var httpBody string = `
<ul class="good">
<li>1</li>
<li>2</li>
<li>3</li>
</ul>
<ul class="bad">
<li>a</li>
<li>b</li>
<li>c</li>
</ul>
`
func main() {
b := bytes.NewBufferString(httpBody)
doc, err := goquery.NewDocumentFromReader(b)
if err != nil {
log.Fatal(err)
}
doc.Find("ul.good").Each(func(i int, ul *goquery.Selection) {
ul.Find("li").Each(func(i int, li *goquery.Selection) {
fmt.Println(li.Text())
})
})
}
这将打印出:
1
2
3
英文:
I have, in the past, used this package: https://github.com/PuerkitoBio/goquery
It provides a "jQuery-like" interface/querying across HTML documents. With that library, its as simple as this:
import (
"bytes"
"fmt"
"log"
"github.com/PuerkitoBio/goquery"
)
var httpBody string = `
<ul class ="good">
<li>1</li>
<li>2</li>
<li>3</li>
</ul>
<ul class ="bad">
<li>a</li>
<li>b</li>
<li>c</li>
</ul>
`
func main() {
b := bytes.NewBufferString(httpBody)
doc, err := goquery.NewDocumentFromReader(b)
if err != nil {
log.Fatal(err)
}
doc.Find("ul.good").Each(func(i int, ul *goquery.Selection) {
ul.Find("li").Each(func(i int, li *goquery.Selection) {
fmt.Println(li.Text())
})
})
}
Which prints:
1
2
3
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论