使用html.ParseFragment的通用方法

huangapple go评论141阅读模式
英文:

Using html.ParseFragment in a generic way

问题

使用实验性的code.google.com/p/go.net/html包,我们可以使用ParseFragment来解析HTML文档的某个子部分。

像这样:

var s = `
    <option id="foo">first</option>
    <option Class="tester">second</option>
    <option>third</option>
`
doc, err := html.ParseFragment(strings.NewReader(s), &html.Node{
    Type: html.ElementNode,
    Data: "body",
    DataAtom: atom.Body,
})

这对大多数元素都有效。但是当某些元素位于HTML的根位置时,如tbodytrtd(以及其他一些元素,不确定),它似乎不起作用。它只会忽略标签并只返回文本内容。

可以通过提供语义上正确的父元素来解决这个问题,而不是使用atom.Body。但这要求我们事先知道HTML的结构。

我希望有一个通用的根元素,类似于atom.DocumentFragment,但我没有找到。那么有没有办法以这样的方式使用它,使其适用于任意的HTML片段?

英文:

Using the experimental code.google.com/p/go.net/html package, we can use ParseFragment to parse some sub-section of an HTML document.

Like this:

var s = `
    <option id="foo">first</option>
    <option Class="tester">second</option>
    <option>third</option>
`
doc, err := html.ParseFragment(strings.NewReader(s), &html.Node{
    Type: html.ElementNode,
    Data: "body",
    DataAtom: atom.Body,
})

This works fine for most elements. But it doesn't seem to work when certain elements are at the root position of the HTML, like tbody, tr, and td (and perhaps others, not sure). It simply ignores the tags and only gives the text content.

This can be remedied by providing the semantically correct parent instead of atom.Body, but that requires that we know in advance what the HTML will be.

I'd hoped there was a generic root like atom.DocumentFragment, but I don't see that. So is there some way to use this in such a manner that it'll work with any arbitrary HTML fragment?

答案1

得分: 2

ParseFragment始终是上下文敏感的,因为它遵循HTML5片段解析算法。该算法旨在实现DOM的innerHTML属性,并且从给定的innerHTML字符串生成正确的树取决于周围的上下文(特别是上下文是否在表格中)。

因此,html包无法独立于其上下文解析HTML片段。

如果您需要更多关于解析如何依赖上下文的信息,请参阅http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#reset-the-insertion-mode-appropriately

英文:

ParseFragment is always context-sensitive because it follows the HTML5 fragment-parsing algorithm. That algorithm is designed for implementing the DOM innerHTML property, and the correct tree to generate from a given innerHTML string depends on the surrounding context (especially whether the context is in a table or not).

So the html package has no way to parse an HTML fragment independently of its context.

If you need more information about how the parsing depends on the context, see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#reset-the-insertion-mode-appropriately

huangapple
  • 本文由 发表于 2014年1月29日 11:50:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/21421704.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定