Golang Gokogiri 递归 xpath 异常

huangapple go评论51阅读模式
英文:

Golang Gokogiri recursive xpath anomaly

问题

我正在尝试在HTML文档上执行XPath操作。我想要进行一个两级XPath查询。HTML文档 "index.html" 如下所示:

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Document</title>
</head>
<body>
	<div class="head">
		<div class="area">
			<div class="value">10</div>
		</div>
		<div class="area">
			<div class="value">20</div>
		</div>
		<div class="area">
			<div class="value">30</div>
		</div>
	</div>
</body>
</html>

我想要首先获取所有 class="area" 的 div,然后递归获取其中 class="value" 的 div,使用 Gokogiri 在 Golang 中实现。

我的Go代码如下:
package main

import (
	"fmt"
	"io/ioutil"

	"github.com/moovweb/gokogiri"
	"github.com/moovweb/gokogiri/xpath"
)

func main() {
	content, _ := ioutil.ReadFile("index.html")

	doc, _ := gokogiri.ParseHtml(content)
	defer doc.Free()

	xps := xpath.Compile("//div[@class='head']/div[@class='area']")
	xpw := xpath.Compile("//div[@class='value']")
	ss, _ := doc.Root().Search(xps)
	for _, s := range ss {
		ww, _ := s.Search(xpw)
		for _, w := range ww {
			fmt.Println(w.InnerHtml())
		}
	}
}

然而,我得到的输出结果很奇怪:

10
20
30
10
20
30
10
20
30

我本意是要得到:

10
20
30

我想要递归搜索XPath模式。我认为我的第二级XPath模式有问题。似乎我的第二级XPath又在整个文档中搜索,而不是在具有 class="area" 的各个 div 中搜索。我应该如何进行递归XPath模式搜索?我会非常感谢任何帮助。

英文:

I was trying to perform xpath operations on a html document. I wanted to do a two-level xpath query. The html document "index.html" is as follows:

&lt;!DOCTYPE html&gt;
&lt;html lang=&quot;en&quot;&gt;
&lt;head&gt;
	&lt;meta charset=&quot;UTF-8&quot;&gt;
	&lt;title&gt;Document&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
	&lt;div class=&quot;head&quot;&gt;
		&lt;div class=&quot;area&quot;&gt;
			&lt;div class=&quot;value&quot;&gt;10&lt;/div&gt;
		&lt;/div&gt;
		&lt;div class=&quot;area&quot;&gt;
			&lt;div class=&quot;value&quot;&gt;20&lt;/div&gt;
		&lt;/div&gt;
		&lt;div class=&quot;area&quot;&gt;
			&lt;div class=&quot;value&quot;&gt;30&lt;/div&gt;
		&lt;/div&gt;
	&lt;/div&gt;
&lt;/body&gt;
&lt;/html&gt;

I wanted to get all divs with class="area" first, then recursively get divs inside it with class="value" in golang using Gokogiri.

My go code is as follows:
package main

import (
	&quot;fmt&quot;
	&quot;io/ioutil&quot;

	&quot;github.com/moovweb/gokogiri&quot;
	&quot;github.com/moovweb/gokogiri/xpath&quot;
)

func main() {
	content, _ := ioutil.ReadFile(&quot;index.html&quot;)

	doc, _ := gokogiri.ParseHtml(content)
	defer doc.Free()

	xps := xpath.Compile(&quot;//div[@class=&#39;head&#39;]/div[@class=&#39;area&#39;]&quot;)
	xpw := xpath.Compile(&quot;//div[@class=&#39;value&#39;]&quot;)
	ss, _ := doc.Root().Search(xps)
	for _, s := range ss {
		ww, _ := s.Search(xpw)
		for _, w := range ww {
			fmt.Println(w.InnerHtml())
		}
	}
}

However, the output I get is odd:

10
20
30
10
20
30
10
20
30

I intend to get:

10
20
30

I want to recursively search for xpath patterns. I think there is something wrong with my second level xpath pattern. It appears, my second level xpath is again search in the whole document instead of individual divs with class="area". What do I do for recursive xpath patterns search? I'd appreciate any help.

答案1

得分: 6

从任何节点开始的XPath搜索仍然可以搜索整个树。

如果你想只搜索子树,你可以以 . 开头(假设你仍然想要包含自身和后代节点),否则使用精确路径。

xps := xpath.Compile("//div[@class='head']/div[@class='area']")
xpw := xpath.Compile(".//div[@class='value']")

// 在你的示例中这样也可以工作
// xpw := xpath.Compile("div[@class='value']")
// 或者这样
// xpw := xpath.Compile("./div[@class='value']")

ss, _ := doc.Root().Search(xps)
for _, s := range ss {
    ww, _ := s.Search(xpw)
    for _, w := range ww {
        fmt.Println(w.InnerHtml())
    }
}

输出:

10
20
30
英文:

An XPath search from any node can still search the entire tree.

If you want to search just the subtree, you can start the expression with a . (assuming you still want descendant-or-self), otherwise use a exact path.

xps := xpath.Compile(&quot;//div[@class=&#39;head&#39;]/div[@class=&#39;area&#39;]&quot;)
xpw := xpath.Compile(&quot;.//div[@class=&#39;value&#39;]&quot;)

// this works in your example case
// xpw := xpath.Compile(&quot;div[@class=&#39;value&#39;]&quot;)
// as does this
// xpw := xpath.Compile(&quot;./div[@class=&#39;value&#39;]&quot;)

ss, _ := doc.Root().Search(xps)
for _, s := range ss {
    ww, _ := s.Search(xpw)
    for _, w := range ww {
        fmt.Println(w.InnerHtml())
    }
}

Prints:

10
20
30

答案2

得分: 2

你的第二个查询 //div[@class=&#39;value&#39;],将选择文档中任意位置的 div 元素,而不考虑父元素。相反,尝试使用 div[@class=&#39;value&#39;]

英文:

Your second query, //div[@class=&#39;value&#39;], will select divs anywhere in the document regardless of the parent element. Instead, try div[@class=&#39;value&#39;].

huangapple
  • 本文由 发表于 2014年8月19日 23:01:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/25386761.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定