内存泄漏与HTML渲染函数相关。

huangapple go评论129阅读模式
英文:

memory leak with html render function

问题

我正在面临一个问题,即使只尝试200个请求,程序也会消耗6GB的内存,并最终被OOM杀死。我的想法是提取HTML中的所有文本节点,然后处理它们以提取它们的名称、标签的HTML和文本。所以为了生成特定标签的HTML,我使用了来自golang.org/x/net/html的Render函数。在这个函数中,我将strings.Builder作为io.Writer提供给它来写入生成的HTML。但由于某种原因,builder消耗了太多的内存。

如果你想要特定的URL列表,我在这里提供了它。我一次发送了大约60个请求。

我尝试过使用bytes.Buffer和使用bytes.Buffersync.Pool,但两者都有相同的问题。使用pprof,我注意到strings.BuilderWriteString方法导致了巨大的内存使用。

英文:

I'm facing issue where even trying just 200 requests cause program to eat up 6Gb of memory for container and eventually be killed by OOM.
Idea is I'm extracting all text nodes present in html and then processing them to extarct their names, html of that tag, and text. So for generating html of perticular tags I'm using Render function from golang.org/x/net/html. In which i provide strings.Builder as io.Writer to write generated html. But for some reason builder eats up too much memory.

package main

import (
	"encoding/csv"
	"io"
	"log"
	"net/http"
	"strings"
	"golang.org/x/net/html"
)

func main() {
	mux := http.NewServeMux()
	mux.HandleFunc("/data", GetData)
	if err := http.ListenAndServe(":8001", mux); err != nil {
		log.Println(err)
	}
}

type TagInfo struct {
	Tag  string
	Name string
	Text string
}

// http.handler
func GetData(w http.ResponseWriter, r *http.Request) {
	u := r.URL.Query().Get("url")
	doc, err := GetDoc(u)
	if err != nil {
		log.Println(err)
		w.WriteHeader(500)
		return
	}
	var buf strings.Builder
	data := Extract(doc, &buf)
	csvw := csv.NewWriter(io.Discard)
	for _, d := range data {
		csvw.Write([]string{d.Name, d.Tag, d.Text})
	}
}

// fires request and get text/html
func GetDoc(u string) (*html.Node, error) {
	res, err := http.Get(u)
	if err != nil {
		return nil, err
	}
	defer res.Body.Close()
	return html.Parse(res.Body)
}

func Extract(doc *html.Node, buf *strings.Builder) []TagInfo {
	var (
		tags = make([]TagInfo, 0, 100)
		f    func(*html.Node)
	)

	f = func(n *html.Node) {
		if n.Type == html.TextNode {
			text := strings.TrimSpace(n.Data)
			if text != "" {
				parent := n.Parent
				tag := Render(parent, buf)
				tagInfo := TagInfo{
					Tag:  tag,
					Name: parent.Data,
					Text: n.Data,
				}
				tags = append(tags, tagInfo)
			}
		}
		for child := n.FirstChild; child != nil; child = child.NextSibling {
			f(child)
		}
	}
	f(doc)
	return tags
}

// Render the html around the tag
// if node is text then pass the
// parent node paramter in function
func Render(n *html.Node, buf *strings.Builder) string {
	defer buf.Reset()
	if err := html.Render(buf, n); err != nil {
		log.Println(err)
		return ""
	}
	return buf.String()
}

if yu want particular url list here it is. i fired around 60 request at a time.

i tried bytes.Buffer and sync.Pool using bytes.Buffer but both have same issue. using pprof i noticed that strings.Builder's WriteString method is causing huge memory use.

答案1

得分: 2

所以这里的基本问题是接受任何content-type,这在大多数需要发送text/html的网站上是不可接受的。

问题是,即使url发送的内容不代表html数据,golang.org/x/net/html仍然接受它而不抛出错误。

让我们以返回application/pdf的示例为例,然后body将包含pdf的二进制数据,html.Parse解析并不会返回任何错误,这是一个奇怪的行为,考虑到该库是用于爬取/抓取数据的。

**解决方案是:**检查响应头,只有在数据是html时才继续,否则会存在歧义或更高的内存使用(可能更低),但我们无法预测会发生什么。

英文:

So the basic issue here was accepting any content-type which is not acceptable in terms of scraping most of the site are required to send text/html.

The problem was even if the url sends any content which is not represents the html data golang.org/x/net/html still accept it without throwing error.

Let's take a example where application/pdf is returned and then the body will contain binary data of pdf which html.Parse parse and doesn't return any error it's weird behaviour thinking library made for scraping/crawling acceping binary data.

The solution is: Check the response header and then proceed if only data is html other wise the there will be ambiguity or higher memory usage(may be lower) but we can't predict what will happen.

huangapple
  • 本文由 发表于 2023年6月28日 05:27:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76568739.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定