英文:
Using html.ParseFragment in a generic way
问题
使用实验性的code.google.com/p/go.net/html
包,我们可以使用ParseFragment
来解析HTML文档的某个子部分。
像这样:
var s = `
<option id="foo">first</option>
<option Class="tester">second</option>
<option>third</option>
`
doc, err := html.ParseFragment(strings.NewReader(s), &html.Node{
Type: html.ElementNode,
Data: "body",
DataAtom: atom.Body,
})
这对大多数元素都有效。但是当某些元素位于HTML的根位置时,如tbody
,tr
和td
(以及其他一些元素,不确定),它似乎不起作用。它只会忽略标签并只返回文本内容。
可以通过提供语义上正确的父元素来解决这个问题,而不是使用atom.Body
。但这要求我们事先知道HTML的结构。
我希望有一个通用的根元素,类似于atom.DocumentFragment
,但我没有找到。那么有没有办法以这样的方式使用它,使其适用于任意的HTML片段?
英文:
Using the experimental code.google.com/p/go.net/html
package, we can use ParseFragment
to parse some sub-section of an HTML document.
Like this:
var s = `
<option id="foo">first</option>
<option Class="tester">second</option>
<option>third</option>
`
doc, err := html.ParseFragment(strings.NewReader(s), &html.Node{
Type: html.ElementNode,
Data: "body",
DataAtom: atom.Body,
})
This works fine for most elements. But it doesn't seem to work when certain elements are at the root position of the HTML, like tbody
, tr
, and td
(and perhaps others, not sure). It simply ignores the tags and only gives the text content.
This can be remedied by providing the semantically correct parent instead of atom.Body
, but that requires that we know in advance what the HTML will be.
I'd hoped there was a generic root like atom.DocumentFragment
, but I don't see that. So is there some way to use this in such a manner that it'll work with any arbitrary HTML fragment?
答案1
得分: 2
ParseFragment
始终是上下文敏感的,因为它遵循HTML5片段解析算法。该算法旨在实现DOM的innerHTML属性,并且从给定的innerHTML字符串生成正确的树取决于周围的上下文(特别是上下文是否在表格中)。
因此,html
包无法独立于其上下文解析HTML片段。
如果您需要更多关于解析如何依赖上下文的信息,请参阅http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#reset-the-insertion-mode-appropriately
英文:
ParseFragment
is always context-sensitive because it follows the HTML5 fragment-parsing algorithm. That algorithm is designed for implementing the DOM innerHTML property, and the correct tree to generate from a given innerHTML string depends on the surrounding context (especially whether the context is in a table or not).
So the html
package has no way to parse an HTML fragment independently of its context.
If you need more information about how the parsing depends on the context, see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#reset-the-insertion-mode-appropriately
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论