将xpath节点转换回Go中的HTML标记

huangapple go评论86阅读模式
英文:

Convert xpath node back to html-markup in Go

问题

以下是翻译的内容:

import (
    "fmt"
    "gopkg.in/xmlpath.v2"
    "log"
)

...

path := xmlpath.MustCompile("//div[@id='23']")
tree, err := xmlpath.ParseHTML(reader)
if err != nil {
    log.Fatal("HTML解析错误,可能不是格式良好的HTML", err)
}

iter := path.Iter(tree)
for iter.Next() {
    fmt.Println(iter.Node().String()) // 仅返回文本节点的值
}

...

是否有办法将`iter.Node()`转换回像`<div>...</div>`这样的HTML标记`iter.Node().String()`仅返回所有内部文本节点的值据我所见[xmlpath-package][1]的文档没有提供这样的功能

[1]: https://godoc.org/gopkg.in/xmlpath.v2
英文:
import (
    &quot;fmt&quot;
    &quot;gopkg.in/xmlpath.v2&quot;
    &quot;log&quot;
)

...

path := xmlpath.MustCompile(&quot;//div[@id=&#39;23&#39;]&quot;)
tree, err := xmlpath.ParseHTML(reader)
if err != nil {
    log.Fatal(&quot;HTML parsing error, maybe not wellformed&quot;, err)
}

iter := path.Iter(tree)
for iter.Next() {
    fmt.Println(iter.Node().String()) // returns only the values of the text-node
}

...

Is there a way to convert iter.Node() back to html markup like &lt;div&gt;...&lt;/div&gt;? iter.Node().String() returns only the values of all inner text nodes. As far as I see the documentation of the xmlpath-package does not offer such function.

答案1

得分: 0

你是对的 - gopkg.in/xmlpath.v2 函数只能读取节点的内容。在Go语言中,没有太多的替代方案来处理DOM。

从原生的Go库中,我只能提到 goquery。它只能处理HTML,不支持XPath,但支持CSS选择器。也许在你的情况下这已经足够了。

如果你确实需要通过XPath处理HTML和XML,可以使用Go的libxml封装库 gokogiri。它支持libxml的所有功能,因此你可以获取节点、内部/外部HTML、属性和其他内容。我曾在一个目前处于生产状态的服务中使用它来提取文本内容。它比PHP的DOMDocument要快一些。唯一的限制是我不确定它是否支持高于1.4.*版本的Go。哦,还有在Windows上安装有点棘手。

英文:

You are right - gopkg.in/xmlpath.v2 functions are limited to read content of nodes. And there is not many alternatives in Go to work with DOM.

From native Go libraries I can mention only goquery. It works only with HTML and does not support XPath but support CSS selectors. Maybe that would be enough in your case.

If you really need to work with both HTML and XML via XPath there is libxml wrapper for Go called gokogiri. It supports all features of libxml so you can get nodes, inner/outerHTML, attributes and other things. I used it to extract text content in one service which currently is in production state. It's a bit faster than PHP's DOMDocument. Only one limitation is fact that I'm not sure if it supports Go versions higher than 1.4.*. Oh and installation on Windows is a bit tricky.

答案2

得分: 0

我知道这个回答有点晚了,但我仍然推荐使用原生Go语言编写的这些包:xqueryxpath。它们支持使用XPath表达式从XML/HTML中提取数据或评估值。

英文:

I know this answer is to late, but still recommend these package written by native Go: xquery and xpath. it supports extract data or evaluate value from XML/HTML using XPath expression.

huangapple
  • 本文由 发表于 2016年4月8日 22:27:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/36502174.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定