问题

使用实验性的code.google.com/p/go.net/html包，我们可以使用ParseFragment来解析HTML文档的某个子部分。

像这样：

var s = `
    &lt;option id=&quot;foo&quot;&gt;first&lt;/option&gt;
    &lt;option Class=&quot;tester&quot;&gt;second&lt;/option&gt;
    &lt;option&gt;third&lt;/option&gt;
`
doc, err := html.ParseFragment(strings.NewReader(s), &amp;html.Node{
    Type: html.ElementNode,
    Data: &quot;body&quot;,
    DataAtom: atom.Body,
})

这对大多数元素都有效。但是当某些元素位于HTML的根位置时，如tbody，tr和td（以及其他一些元素，不确定），它似乎不起作用。它只会忽略标签并只返回文本内容。

可以通过提供语义上正确的父元素来解决这个问题，而不是使用atom.Body。但这要求我们事先知道HTML的结构。

我希望有一个通用的根元素，类似于atom.DocumentFragment，但我没有找到。那么有没有办法以这样的方式使用它，使其适用于任意的HTML片段？

英文:

Using the experimental code.google.com/p/go.net/html package, we can use ParseFragment to parse some sub-section of an HTML document.

Like this:

var s = `
    &lt;option id=&quot;foo&quot;&gt;first&lt;/option&gt;
    &lt;option Class=&quot;tester&quot;&gt;second&lt;/option&gt;
    &lt;option&gt;third&lt;/option&gt;
`
doc, err := html.ParseFragment(strings.NewReader(s), &amp;html.Node{
    Type: html.ElementNode,
    Data: &quot;body&quot;,
    DataAtom: atom.Body,
})

This works fine for most elements. But it doesn't seem to work when certain elements are at the root position of the HTML, like tbody, tr, and td (and perhaps others, not sure). It simply ignores the tags and only gives the text content.

This can be remedied by providing the semantically correct parent instead of atom.Body, but that requires that we know in advance what the HTML will be.

I'd hoped there was a generic root like atom.DocumentFragment, but I don't see that. So is there some way to use this in such a manner that it'll work with any arbitrary HTML fragment?

答案1

得分: 2

ParseFragment始终是上下文敏感的，因为它遵循HTML5片段解析算法。该算法旨在实现DOM的innerHTML属性，并且从给定的innerHTML字符串生成正确的树取决于周围的上下文（特别是上下文是否在表格中）。

因此，html包无法独立于其上下文解析HTML片段。

如果您需要更多关于解析如何依赖上下文的信息，请参阅http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#reset-the-insertion-mode-appropriately

英文:

ParseFragment is always context-sensitive because it follows the HTML5 fragment-parsing algorithm. That algorithm is designed for implementing the DOM innerHTML property, and the correct tree to generate from a given innerHTML string depends on the surrounding context (especially whether the context is in a table or not).

So the html package has no way to parse an HTML fragment independently of its context.

If you need more information about how the parsing depends on the context, see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#reset-the-insertion-mode-appropriately

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用html.ParseFragment的通用方法

问题

答案1

如何在子域名之间进行身份验证

无法在gqlgen的GraphQL模型中使用UUID作为ID类型。

Unmarshal a dynamic json

在服务器程序中遇到了I/O超时错误。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论