英文:
memory leak with html render function
问题
我正在面临一个问题,即使只尝试200个请求,程序也会消耗6GB的内存,并最终被OOM杀死。我的想法是提取HTML中的所有文本节点,然后处理它们以提取它们的名称、标签的HTML和文本。所以为了生成特定标签的HTML,我使用了来自golang.org/x/net/html的Render函数。在这个函数中,我将strings.Builder作为io.Writer提供给它来写入生成的HTML。但由于某种原因,builder消耗了太多的内存。
如果你想要特定的URL列表,我在这里提供了它。我一次发送了大约60个请求。
我尝试过使用bytes.Buffer
和使用bytes.Buffer
的sync.Pool
,但两者都有相同的问题。使用pprof
,我注意到strings.Builder
的WriteString
方法导致了巨大的内存使用。
英文:
I'm facing issue where even trying just 200 requests cause program to eat up 6Gb of memory for container and eventually be killed by OOM.
Idea is I'm extracting all text nodes present in html and then processing them to extarct their names, html of that tag, and text. So for generating html of perticular tags I'm using Render function from golang.org/x/net/html. In which i provide strings.Builder as io.Writer to write generated html. But for some reason builder eats up too much memory.
package main
import (
"encoding/csv"
"io"
"log"
"net/http"
"strings"
"golang.org/x/net/html"
)
func main() {
mux := http.NewServeMux()
mux.HandleFunc("/data", GetData)
if err := http.ListenAndServe(":8001", mux); err != nil {
log.Println(err)
}
}
type TagInfo struct {
Tag string
Name string
Text string
}
// http.handler
func GetData(w http.ResponseWriter, r *http.Request) {
u := r.URL.Query().Get("url")
doc, err := GetDoc(u)
if err != nil {
log.Println(err)
w.WriteHeader(500)
return
}
var buf strings.Builder
data := Extract(doc, &buf)
csvw := csv.NewWriter(io.Discard)
for _, d := range data {
csvw.Write([]string{d.Name, d.Tag, d.Text})
}
}
// fires request and get text/html
func GetDoc(u string) (*html.Node, error) {
res, err := http.Get(u)
if err != nil {
return nil, err
}
defer res.Body.Close()
return html.Parse(res.Body)
}
func Extract(doc *html.Node, buf *strings.Builder) []TagInfo {
var (
tags = make([]TagInfo, 0, 100)
f func(*html.Node)
)
f = func(n *html.Node) {
if n.Type == html.TextNode {
text := strings.TrimSpace(n.Data)
if text != "" {
parent := n.Parent
tag := Render(parent, buf)
tagInfo := TagInfo{
Tag: tag,
Name: parent.Data,
Text: n.Data,
}
tags = append(tags, tagInfo)
}
}
for child := n.FirstChild; child != nil; child = child.NextSibling {
f(child)
}
}
f(doc)
return tags
}
// Render the html around the tag
// if node is text then pass the
// parent node paramter in function
func Render(n *html.Node, buf *strings.Builder) string {
defer buf.Reset()
if err := html.Render(buf, n); err != nil {
log.Println(err)
return ""
}
return buf.String()
}
if yu want particular url list here it is. i fired around 60 request at a time.
i tried bytes.Buffer
and sync.Pool
using bytes.Buffer but both have same issue. using pprof
i noticed that strings.Builder's WriteString
method is causing huge memory use.
答案1
得分: 2
所以这里的基本问题是接受任何content-type
,这在大多数需要发送text/html
的网站上是不可接受的。
问题是,即使url发送的内容不代表html数据,golang.org/x/net/html
仍然接受它而不抛出错误。
让我们以返回application/pdf
的示例为例,然后body将包含pdf的二进制数据,html.Parse
解析并不会返回任何错误,这是一个奇怪的行为,考虑到该库是用于爬取/抓取数据的。
**解决方案是:**检查响应头,只有在数据是html时才继续,否则会存在歧义或更高的内存使用(可能更低),但我们无法预测会发生什么。
英文:
So the basic issue here was accepting any content-type
which is not acceptable in terms of scraping most of the site are required to send text/html
.
The problem was even if the url sends any content which is not represents the html data golang.org/x/net/html
still accept it without throwing error.
Let's take a example where application/pdf
is returned and then the body will contain binary data of pdf which html.Parse
parse and doesn't return any error it's weird behaviour thinking library made for scraping/crawling acceping binary data.
The solution is: Check the response header and then proceed if only data is html other wise the there will be ambiguity or higher memory usage(may be lower) but we can't predict what will happen.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论