如何获取HTML元素的内容

huangapple go评论85阅读模式
英文:

How to get the contents of a HTML element

问题

我对Go语言还不太熟悉,目前在解析一些HTML方面遇到了一些困难。

HTML的结构如下:

<!DOCTYPE html>
<html>
<head>
	<title></title>
</head>
<body>

	<div>something</div>

	<div id="publication">
		<div>I want <span>this</span></div>
	</div>

	<div>
		<div>not this</div>
	</div>

</body>
</html>

我想要将以下内容作为字符串提取出来:

<div>I want <span>this</span></div>

我尝试过使用html.NewTokenizer()(来自golang.org/x/net/html),但似乎无法从令牌或节点中获取整个元素的内容。我还尝试过使用depth,但它会捕捉到其他代码片段。

我还尝试了goquery,它看起来很完美,代码如下:

doc, err := goquery.NewDocument("{url}")
if err != nil {
    log.Fatal(err)
}

doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
    fmt.Printf("Review %d: %s\n", i, s.Html())
})

但是s.Text()只会打印出文本,而s.Html()似乎不存在(?)。

我认为将其解析为XML可能会起作用,但实际的HTML结构非常复杂,每个父元素都需要一个结构体...

任何帮助都将是非常棒的!

英文:

I'm quite new to Go and I'm struggling a little at the moment with parsing some html.

The HTML looks like:

&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
	&lt;title&gt;&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;

	&lt;div&gt;something&lt;/div&gt;

	&lt;div id=&quot;publication&quot;&gt;
		&lt;div&gt;I want &lt;span&gt;this&lt;/span&gt;&lt;/div&gt;
	&lt;/div&gt;

	&lt;div&gt;
		&lt;div&gt;not this&lt;/div&gt;
	&lt;/div&gt;

&lt;/body&gt;
&lt;/html&gt;

And I want to get this as a string:

&lt;div&gt;I want &lt;span&gt;this&lt;/span&gt;&lt;/div&gt;

I've tried html.NewTokenizer() (from golang.org/x/net/html) but can't seem to get the entire contents of an element back from a token or node. I've also tried using depth with this but it picked up other bits of code.

I've also had a go with goquery which seems perfect, code:

doc, err := goquery.NewDocument(&quot;{url}&quot;)
if err != nil {
	log.Fatal(err)
}

doc.Find(&quot;#publication&quot;).Each(func(i int, s *goquery.Selection) {
	fmt.Printf(&quot;Review %d: %s\n&quot;, i, s.Html())
})

But s.Text() will only print out the text and s.Html() doesn't seem to exist (?).

I think parsing it as XML would work, except the actual HTML is very deep and there would have to be a struct for each parent element...

Any help would be amazing!

答案1

得分: 2

你没有得到结果(实际上存在 s.Html()),是因为你没有设置变量和错误处理程序。

请将以下代码添加到你的代码中,它将正常工作:

doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
    inside_html, _ := s.Html() //下划线是一个错误
    fmt.Printf("Review %d: %s\n", i, inside_html)
})
英文:

You're not getting the result (s.Html() actually exist), because you haven't set the variable and error handler.

Please add this to your code and it will work fine:

doc.Find(&quot;#publication&quot;).Each(func(i int, s *goquery.Selection) {
    inside_html,_ := s.Html() //underscore is an error
    fmt.Printf(&quot;Review %d: %s\n&quot;, i, inside_html)
})

huangapple
  • 本文由 发表于 2016年1月5日 02:44:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/34597717.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定