英文:
Go doesn't release memory after http.Get
问题
我正在使用简单的线程池加载网页,同时从文件中动态加载URL。但是这个小程序会慢慢分配与我的服务器一样多的内存,直到内存杀手停止它。看起来resp.Body.Close()不会释放正文文本的内存(内存大小约为下载页面数*平均页面大小)。我该如何强制golang释放为正文HTML文本分配的内存?
package main
import (
"bufio"
"fmt"
"io/ioutil"
"net/http"
"os"
"strings"
"sync"
)
func worker(linkChan chan string, wg *sync.WaitGroup) {
defer wg.Done()
for url := range linkChan {
// 获取正文文本
resp, err := http.Get(url)
if err != nil {
fmt.Printf("失败的URL:%s\n", url)
continue
}
body, err := ioutil.ReadAll(resp.Body)
resp.Body.Close()
if err != nil {
fmt.Printf("失败的URL:%s\n", url)
continue
}
// 测试页面正文
has_rem_code := strings.Contains(string(body), "googleadservices.com/pagead/conversion.js")
fmt.Printf("完成的URL:%s\t%t\n", url, has_rem_code)
}
}
func main() {
// 创建工作池
lCh := make(chan string, 30)
wg := new(sync.WaitGroup)
for i := 0; i < 30; i++ {
wg.Add(1)
go worker(lCh, wg)
}
// 打开包含URL的文件
file, err := os.Open("./tmp/new.csv")
defer file.Close()
if err != nil {
panic(err)
}
reader := bufio.NewReader(file)
// 处理URL
for href, _, err := reader.ReadLine(); err == nil; href, _, err = reader.ReadLine() {
lCh <- string(href)
}
close(lCh)
wg.Wait()
}
这是pprof工具的一些输出:
flat flat% sum% cum cum%
34.63MB 29.39% 29.39% 34.63MB 29.39% bufio.NewReaderSize
30MB 25.46% 54.84% 30MB 25.46% net/http.(*Transport).getIdleConnCh
23.09MB 19.59% 74.44% 23.09MB 19.59% bufio.NewWriter
11.63MB 9.87% 84.30% 11.63MB 9.87% net/http.(*Transport).putIdleConn
6.50MB 5.52% 89.82% 6.50MB 5.52% main.main
看起来像是这个问题(https://github.com/golang/go/issues/5794),但是已经在两年前修复了。
英文:
I am loading web pages using simple thread pool, while dynamically loading urls from file. But this small program slowly allocate as much memory as my server has, until omm killer stops it. It looks like resp.Body.Close() doesn't free memory for body text (memory size ~ downloaded pages * avg page size). How can I force golang to free memory allocated for body html text?
package main
import (
"bufio"
"fmt"
"io/ioutil"
"net/http"
"os"
"strings"
"sync"
)
func worker(linkChan chan string, wg *sync.WaitGroup) {
defer wg.Done()
for url := range linkChan {
// Getting body text
resp, err := http.Get(url)
if err != nil {
fmt.Printf("Fail url: %s\n", url)
continue
}
body, err := ioutil.ReadAll(resp.Body)
resp.Body.Close()
if err != nil {
fmt.Printf("Fail url: %s\n", url)
continue
}
// Test page body
has_rem_code := strings.Contains(string(body), "googleadservices.com/pagead/conversion.js")
fmt.Printf("Done url: %s\t%t\n", url, has_rem_code)
}
}
func main() {
// Creating worker pool
lCh := make(chan string, 30)
wg := new(sync.WaitGroup)
for i := 0; i < 30; i++ {
wg.Add(1)
go worker(lCh, wg)
}
// Opening file with urls
file, err := os.Open("./tmp/new.csv")
defer file.Close()
if err != nil {
panic(err)
}
reader := bufio.NewReader(file)
// Processing urls
for href, _, err := reader.ReadLine(); err == nil; href, _, err = reader.ReadLine() {
lCh <- string(href)
}
close(lCh)
wg.Wait()
}
Here is some output from pprof tool:
flat flat% sum% cum cum%
34.63MB 29.39% 29.39% 34.63MB 29.39% bufio.NewReaderSize
30MB 25.46% 54.84% 30MB 25.46% net/http.(*Transport).getIdleConnCh
23.09MB 19.59% 74.44% 23.09MB 19.59% bufio.NewWriter
11.63MB 9.87% 84.30% 11.63MB 9.87% net/http.(*Transport).putIdleConn
6.50MB 5.52% 89.82% 6.50MB 5.52% main.main
Looks like this issue, but it's fixed 2 years ago.
答案1
得分: 5
在golang-nuts论坛的这个帖子中找到了答案。在我的情况下(数十万个不同的主机),http.Transport
会保存连接以便将来重用,导致内存膨胀。但是完全禁用KeepAlives可以解决这个问题。
以下是工作的代码:
func worker(linkChan chan string, wg *sync.WaitGroup) {
defer wg.Done()
var transport http.RoundTripper = &http.Transport{
DisableKeepAlives: true,
}
c := &http.Client{Transport: transport}
for url := range linkChan {
// 获取页面内容
resp, err := c.Get(url)
if err != nil {
fmt.Printf("失败的URL:%s\n", url)
continue
}
body, err := ioutil.ReadAll(resp.Body)
resp.Body.Close()
if err != nil {
fmt.Printf("失败的URL:%s\n", url)
continue
}
// 检查页面内容
has_rem_code := strings.Contains(string(body), "googleadservices.com/pagead/conversion.js")
fmt.Printf("完成的URL:%s\t%t\n", url, has_rem_code)
}
}
希望对你有帮助!
英文:
Found the answer in this thread on golang-nuts. http.Transport
saves connections for future reusing in case of request to same host, causing memory bloating in my case (hundreds thousands of different hosts). But disabling KeepAlives totally solves that problem.
Working code:
func worker(linkChan chan string, wg *sync.WaitGroup) {
defer wg.Done()
var transport http.RoundTripper = &http.Transport{
DisableKeepAlives: true,
}
c := &http.Client{Transport: transport}
for url := range linkChan {
// Getting body text
resp, err := c.Get(url)
if err != nil {
fmt.Printf("Fail url: %s\n", url)
continue
}
body, err := ioutil.ReadAll(resp.Body)
resp.Body.Close()
if err != nil {
fmt.Printf("Fail url: %s\n", url)
continue
}
// Test page body
has_rem_code := strings.Contains(string(body), "googleadservices.com/pagead/conversion.js")
fmt.Printf("Done url: %s\t%t\n", url, has_rem_code)
}
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论