英文:
How to get the contents of a HTML element
问题
我对Go语言还不太熟悉,目前在解析一些HTML方面遇到了一些困难。
HTML的结构如下:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div>something</div>
<div id="publication">
<div>I want <span>this</span></div>
</div>
<div>
<div>not this</div>
</div>
</body>
</html>
我想要将以下内容作为字符串提取出来:
<div>I want <span>this</span></div>
我尝试过使用html.NewTokenizer()(来自golang.org/x/net/html),但似乎无法从令牌或节点中获取整个元素的内容。我还尝试过使用depth,但它会捕捉到其他代码片段。
我还尝试了goquery,它看起来很完美,代码如下:
doc, err := goquery.NewDocument("{url}")
if err != nil {
log.Fatal(err)
}
doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Review %d: %s\n", i, s.Html())
})
但是s.Text()只会打印出文本,而s.Html()似乎不存在(?)。
我认为将其解析为XML可能会起作用,但实际的HTML结构非常复杂,每个父元素都需要一个结构体...
任何帮助都将是非常棒的!
英文:
I'm quite new to Go and I'm struggling a little at the moment with parsing some html.
The HTML looks like:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div>something</div>
<div id="publication">
<div>I want <span>this</span></div>
</div>
<div>
<div>not this</div>
</div>
</body>
</html>
And I want to get this as a string:
<div>I want <span>this</span></div>
I've tried html.NewTokenizer() (from golang.org/x/net/html) but can't seem to get the entire contents of an element back from a token or node. I've also tried using depth with this but it picked up other bits of code.
I've also had a go with goquery which seems perfect, code:
doc, err := goquery.NewDocument("{url}")
if err != nil {
log.Fatal(err)
}
doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Review %d: %s\n", i, s.Html())
})
But s.Text() will only print out the text and s.Html() doesn't seem to exist (?).
I think parsing it as XML would work, except the actual HTML is very deep and there would have to be a struct for each parent element...
Any help would be amazing!
答案1
得分: 2
你没有得到结果(实际上存在 s.Html()),是因为你没有设置变量和错误处理程序。
请将以下代码添加到你的代码中,它将正常工作:
doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
inside_html, _ := s.Html() //下划线是一个错误
fmt.Printf("Review %d: %s\n", i, inside_html)
})
英文:
You're not getting the result (s.Html() actually exist), because you haven't set the variable and error handler.
Please add this to your code and it will work fine:
doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
inside_html,_ := s.Html() //underscore is an error
fmt.Printf("Review %d: %s\n", i, inside_html)
})
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论