2015年5月8日 02:34:04go评论98阅读模式

英文:

Golang parse HTML, extract all content with <body> </body> tags

问题

如标题所述，我需要返回HTML文档中body标签内的所有内容，包括任何后续的HTML标签等。我想知道最佳的解决方法是什么。我之前使用过Gokogiri包，但我想避免使用依赖于C库的包。是否有一种方法可以使用Go标准库或100% Go的包来实现这一目标？

在发布我的原始问题后，我尝试使用了以下包，但没有解决方案（这两个包似乎都无法返回body内部的后续子标签或嵌套标签。例如：

<!DOCTYPE html>
<html>
    <head>
        <title>
            Title of the document
        </title>
    </head>
    <body>
        body content 
        <p>more content</p>
    </body>
</html>

将返回body content，忽略后续的<p>标签及其包裹的文本）：

pkg/encoding/xml/（标准库xml包）
golang.org/x/net/html

总体目标是获取一个字符串或内容，看起来像：

<body>
    body content 
    <p>more content</p>
</body>

英文:

As stated in the title. I am needing to return all of the content within the body tags of an html document, including any subsequent html tags, etc. Im curious to know what the best way to go about this is. I had a working solution with the Gokogiri package, however I am trying to stay away from any packages that depend on C libraries. Is there a way to accomplish this with the go standard library? or with a package that is 100% go?

Since posting my original question I have attempted to use the following packages that have yielded no resolution. (Neither of which seem to return subsequent children or nested tags from inside the body. For example:

&lt;!DOCTYPE html&gt;
&lt;html&gt;
    &lt;head&gt;
        &lt;title&gt;
            Title of the document
        &lt;/title&gt;
    &lt;/head&gt;
    &lt;body&gt;
        body content 
        &lt;p&gt;more content&lt;/p&gt;
    &lt;/body&gt;
&lt;/html&gt;

will return body content, ignoring the subsequent <p> tags and the text they wrap):

pkg/encoding/xml/ (standard library xml package)
golang.org/x/net/html

The over all goal would be to obtain a string or content that would look like:

&lt;body&gt;
    body content 
    &lt;p&gt;more content&lt;/p&gt;
&lt;/body&gt;

答案1

得分: 64

这可以通过使用html包递归查找body节点，并从该节点开始渲染html来解决。

package main

import (
	"bytes"
	"errors"
	"fmt"
	"golang.org/x/net/html"
	"io"
	"strings"
)

func Body(doc *html.Node) (*html.Node, error) {
	var body *html.Node
	var crawler func(*html.Node)
	crawler = func(node *html.Node) {
		if node.Type == html.ElementNode && node.Data == "body" {
			body = node
			return
		}
		for child := node.FirstChild; child != nil; child = child.NextSibling {
			crawler(child)
		}
	}
	crawler(doc)
	if body != nil {
		return body, nil
	}
	return nil, errors.New("在节点树中找不到<body>")
}

func renderNode(n *html.Node) string {
	var buf bytes.Buffer
	w := io.Writer(&buf)
	html.Render(w, n)
	return buf.String()
}

func main() {
	doc, _ := html.Parse(strings.NewReader(htm))
	bn, err := Body(doc)
	if err != nil {
		return
	}
	body := renderNode(bn)
	fmt.Println(body)
}

const htm = `<!DOCTYPE html>
<html>
<head>
	<title></title>
</head>
<body>
	body content
	<p>more content</p>
</body>
</html>`

希望对你有帮助！

英文:

This can be solved by recursively finding the body node, using the html package, and subsequently render the html, starting from that node.

package main
import (
&quot;bytes&quot;
&quot;errors&quot;
&quot;fmt&quot;
&quot;golang.org/x/net/html&quot;
&quot;io&quot;
&quot;strings&quot;
)
func Body(doc *html.Node) (*html.Node, error) {
var body *html.Node
var crawler func(*html.Node)
crawler = func(node *html.Node) {
if node.Type == html.ElementNode &amp;&amp; node.Data == &quot;body&quot; {
body = node
return
}
for child := node.FirstChild; child != nil; child = child.NextSibling {
crawler(child)
}
}
crawler(doc)
if body != nil {
return body, nil
}
return nil, errors.New(&quot;Missing &lt;body&gt; in the node tree&quot;)
}
func renderNode(n *html.Node) string {
var buf bytes.Buffer
w := io.Writer(&amp;buf)
html.Render(w, n)
return buf.String()
}
func main() {
doc, _ := html.Parse(strings.NewReader(htm))
bn, err := Body(doc)
if err != nil {
return
}
body := renderNode(bn)
fmt.Println(body)
}
const htm = `&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
&lt;title&gt;&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
body content
&lt;p&gt;more content&lt;/p&gt;
&lt;/body&gt;
&lt;/html&gt;`

答案2

得分: 11

可以使用标准的encoding/xml包来完成。但是这个例子有点繁琐。需要注意的是，它不会包含包围的body标签，但会包含所有子元素。

package main

import (
	"bytes"
	"encoding/xml"
	"fmt"
)

type html struct {
	Body body `xml:"body"`
}
type body struct {
	Content string `xml:",innerxml"`
}

func main() {
	b := []byte(`<!DOCTYPE html>
<html>
    <head>
        <title>
            Title of the document
        </title>
    </head>
    <body>
        body content 
        <p>more content</p>
    </body>
</html>`)

	h := html{}
	err := xml.NewDecoder(bytes.NewBuffer(b)).Decode(&h)
	if err != nil {
		fmt.Println("error", err)
		return
	}

	fmt.Println(h.Body.Content)
}

可运行的示例：
http://play.golang.org/p/ZH5iKyjRQp

英文:

It can be done using the standard encoding/xml package. But it's a bit cumbersome. And one caveat in this example is that it will not include the enclosing body tag, but it will contain all of it's children.

package main
import (
&quot;bytes&quot;
&quot;encoding/xml&quot;
&quot;fmt&quot;
)
type html struct {
Body body `xml:&quot;body&quot;`
}
type body struct {
Content string `xml:&quot;,innerxml&quot;`
}
func main() {
b := []byte(`&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
&lt;title&gt;
Title of the document
&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
body content 
&lt;p&gt;more content&lt;/p&gt;
&lt;/body&gt;
&lt;/html&gt;`)
h := html{}
err := xml.NewDecoder(bytes.NewBuffer(b)).Decode(&amp;h)
if err != nil {
fmt.Println(&quot;error&quot;, err)
return
}
fmt.Println(h.Body.Content)
}

Runnable example:
http://play.golang.org/p/ZH5iKyjRQp

答案3

得分: 7

由于您没有展示使用html包的源代码，我只能猜测您的操作，但我怀疑您使用的是分词器而不是解析器。以下是一个使用解析器并实现您所需功能的程序：

package main

import (
	"log"
	"os"
	"strings"

	"github.com/andybalholm/cascadia"
	"golang.org/x/net/html"
)

func main() {
	r := strings.NewReader(`<!DOCTYPE html>
<html>
	<head>
		<title>
			Title of the document
		</title>
	</head>
	<body>
		body content 
		<p>more content</p>
	</body>
</html>`)
	doc, err := html.Parse(r)
	if err != nil {
		log.Fatal(err)
	}

	body := cascadia.MustCompile("body").MatchFirst(doc)
	html.Render(os.Stdout, body)
}

希望对您有所帮助！

英文:

Since you didn't show the source code of your attempt with the html package, I'll have to guess what you were doing, but I suspect you were using the tokenizer rather than the parser. Here is a program that uses the parser and does what you were looking for:

package main

import (
	&quot;log&quot;
	&quot;os&quot;
	&quot;strings&quot;

	&quot;github.com/andybalholm/cascadia&quot;
	&quot;golang.org/x/net/html&quot;
)

func main() {
	r := strings.NewReader(`&lt;!DOCTYPE html&gt;
&lt;html&gt;
	&lt;head&gt;
		&lt;title&gt;
			Title of the document
		&lt;/title&gt;
	&lt;/head&gt;
	&lt;body&gt;
		body content 
		&lt;p&gt;more content&lt;/p&gt;
	&lt;/body&gt;
&lt;/html&gt;`)
	doc, err := html.Parse(r)
	if err != nil {
		log.Fatal(err)
	}

	body := cascadia.MustCompile(&quot;body&quot;).MatchFirst(doc)
	html.Render(os.Stdout, body)
}

答案4

得分: 2

你也可以完全使用字符串来实现这个功能：

func main() {
    r := strings.NewReader(`
<!DOCTYPE html>
<html>
    <head>
        <title>
            Title of the document
        </title>
    </head>
    <body>
        body content
        <p>more content</p>
    </body>
</html>
`)
    str := NewSkipTillReader(r, []byte("<body>"))
    rtr := NewReadTillReader(str, []byte("</body>"))
    bs, err := ioutil.ReadAll(rtr)
    fmt.Println(string(bs), err)
}

SkipTillReader 和 ReadTillReader 的定义在这里：https://play.golang.org/p/6THLhRgLOa。（基本上是跳过直到看到分隔符，然后读取直到看到分隔符）

这种方法对大小写不敏感（尽管更改这一点并不难）。

英文:

You could also do this purely with strings:

func main() {
r := strings.NewReader(`
&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
&lt;title&gt;
Title of the document
&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
body content
&lt;p&gt;more content&lt;/p&gt;
&lt;/body&gt;
&lt;/html&gt;
`)
str := NewSkipTillReader(r, []byte(&quot;&lt;body&gt;&quot;))
rtr := NewReadTillReader(str, []byte(&quot;&lt;/body&gt;&quot;))
bs, err := ioutil.ReadAll(rtr)
fmt.Println(string(bs), err)
}

The definitions for SkipTillReader and ReadTillReader are here: https://play.golang.org/p/6THLhRgLOa. (But basically skip until you see the delimiter and then read until you see the delimiter)

This won't work for case insensitivity (though that wouldn't be hard to change).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Golang解析HTML，提取所有带有``和``标签的内容。

问题

答案1

答案2

答案3

答案4

Golang Paho MQTT 丢失消息

Go ChromeDP ignores any external or internal css during printing to pdf and uses only those, that in html file ONLY

div标签已被另一个标签关闭。

在Go编程语言中，任何类型都可以实现泛型列表。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论