问题

我正在面临一个问题，即使只尝试200个请求，程序也会消耗6GB的内存，并最终被OOM杀死。我的想法是提取HTML中的所有文本节点，然后处理它们以提取它们的名称、标签的HTML和文本。所以为了生成特定标签的HTML，我使用了来自golang.org/x/net/html的Render函数。在这个函数中，我将strings.Builder作为io.Writer提供给它来写入生成的HTML。但由于某种原因，builder消耗了太多的内存。

如果你想要特定的URL列表，我在这里提供了它。我一次发送了大约60个请求。

我尝试过使用bytes.Buffer和使用bytes.Buffer的sync.Pool，但两者都有相同的问题。使用pprof，我注意到strings.Builder的WriteString方法导致了巨大的内存使用。

英文:

I'm facing issue where even trying just 200 requests cause program to eat up 6Gb of memory for container and eventually be killed by OOM.
Idea is I'm extracting all text nodes present in html and then processing them to extarct their names, html of that tag, and text. So for generating html of perticular tags I'm using Render function from golang.org/x/net/html. In which i provide strings.Builder as io.Writer to write generated html. But for some reason builder eats up too much memory.

package main

import (
	&quot;encoding/csv&quot;
	&quot;io&quot;
	&quot;log&quot;
	&quot;net/http&quot;
	&quot;strings&quot;
	&quot;golang.org/x/net/html&quot;
)

func main() {
	mux := http.NewServeMux()
	mux.HandleFunc(&quot;/data&quot;, GetData)
	if err := http.ListenAndServe(&quot;:8001&quot;, mux); err != nil {
		log.Println(err)
	}
}

type TagInfo struct {
	Tag  string
	Name string
	Text string
}

// http.handler
func GetData(w http.ResponseWriter, r *http.Request) {
	u := r.URL.Query().Get(&quot;url&quot;)
	doc, err := GetDoc(u)
	if err != nil {
		log.Println(err)
		w.WriteHeader(500)
		return
	}
	var buf strings.Builder
	data := Extract(doc, &amp;buf)
	csvw := csv.NewWriter(io.Discard)
	for _, d := range data {
		csvw.Write([]string{d.Name, d.Tag, d.Text})
	}
}

// fires request and get text/html
func GetDoc(u string) (*html.Node, error) {
	res, err := http.Get(u)
	if err != nil {
		return nil, err
	}
	defer res.Body.Close()
	return html.Parse(res.Body)
}

func Extract(doc *html.Node, buf *strings.Builder) []TagInfo {
	var (
		tags = make([]TagInfo, 0, 100)
		f    func(*html.Node)
	)

	f = func(n *html.Node) {
		if n.Type == html.TextNode {
			text := strings.TrimSpace(n.Data)
			if text != &quot;&quot; {
				parent := n.Parent
				tag := Render(parent, buf)
				tagInfo := TagInfo{
					Tag:  tag,
					Name: parent.Data,
					Text: n.Data,
				}
				tags = append(tags, tagInfo)
			}
		}
		for child := n.FirstChild; child != nil; child = child.NextSibling {
			f(child)
		}
	}
	f(doc)
	return tags
}

// Render the html around the tag
// if node is text then pass the
// parent node paramter in function
func Render(n *html.Node, buf *strings.Builder) string {
	defer buf.Reset()
	if err := html.Render(buf, n); err != nil {
		log.Println(err)
		return &quot;&quot;
	}
	return buf.String()
}

if yu want particular url list here it is. i fired around 60 request at a time.

i tried bytes.Buffer and sync.Pool using bytes.Buffer but both have same issue. using pprof i noticed that strings.Builder's WriteString method is causing huge memory use.

答案1

得分: 2

所以这里的基本问题是接受任何content-type，这在大多数需要发送text/html的网站上是不可接受的。

问题是，即使url发送的内容不代表html数据，golang.org/x/net/html仍然接受它而不抛出错误。

让我们以返回application/pdf的示例为例，然后body将包含pdf的二进制数据，html.Parse解析并不会返回任何错误，这是一个奇怪的行为，考虑到该库是用于爬取/抓取数据的。

**解决方案是：**检查响应头，只有在数据是html时才继续，否则会存在歧义或更高的内存使用（可能更低），但我们无法预测会发生什么。

英文:

So the basic issue here was accepting any content-type which is not acceptable in terms of scraping most of the site are required to send text/html.

The problem was even if the url sends any content which is not represents the html data golang.org/x/net/html still accept it without throwing error.

Let's take a example where application/pdf is returned and then the body will contain binary data of pdf which html.Parse parse and doesn't return any error it's weird behaviour thinking library made for scraping/crawling acceping binary data.

The solution is: Check the response header and then proceed if only data is html other wise the there will be ambiguity or higher memory usage(may be lower) but we can't predict what will happen.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

内存泄漏与HTML渲染函数相关。

问题

答案1

Golang使用PostParams获取所有POST表单数据，并将值作为字符串获取

Returning an array of structure as Json Response using GO

你在Windows下使用Google Go编程时，你的UTF-8控制台设置是什么？

如何在 Golang 中通过 Form-Data 接收参数

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论