问题

我正在使用简单的线程池加载网页，同时从文件中动态加载URL。但是这个小程序会慢慢分配与我的服务器一样多的内存，直到内存杀手停止它。看起来resp.Body.Close()不会释放正文文本的内存（内存大小约为下载页面数*平均页面大小）。我该如何强制golang释放为正文HTML文本分配的内存？

package main

import (
	"bufio"
	"fmt"
	"io/ioutil"
	"net/http"
	"os"
	"strings"
	"sync"
)

func worker(linkChan chan string, wg *sync.WaitGroup) {
	defer wg.Done()

	for url := range linkChan {
		// 获取正文文本
		resp, err := http.Get(url)
		if err != nil {
			fmt.Printf("失败的URL：%s\n", url)
			continue
		}
		body, err := ioutil.ReadAll(resp.Body)
		resp.Body.Close()
		if err != nil {
			fmt.Printf("失败的URL：%s\n", url)
			continue
		}
		// 测试页面正文
		has_rem_code := strings.Contains(string(body), "googleadservices.com/pagead/conversion.js")
		fmt.Printf("完成的URL：%s\t%t\n", url, has_rem_code)
	}
}

func main() {
	// 创建工作池
	lCh := make(chan string, 30)
	wg := new(sync.WaitGroup)

	for i := 0; i < 30; i++ {
		wg.Add(1)
		go worker(lCh, wg)
	}

	// 打开包含URL的文件
	file, err := os.Open("./tmp/new.csv")
	defer file.Close()
	if err != nil {
		panic(err)
	}
	reader := bufio.NewReader(file)

	// 处理URL
	for href, _, err := reader.ReadLine(); err == nil; href, _, err = reader.ReadLine() {
		lCh <- string(href)
	}

	close(lCh)
	wg.Wait()
}

这是pprof工具的一些输出：

      flat  flat%   sum%        cum   cum%
34.63MB 29.39% 29.39%    34.63MB 29.39%  bufio.NewReaderSize
30MB 25.46% 54.84%       30MB 25.46%  net/http.(*Transport).getIdleConnCh
23.09MB 19.59% 74.44%    23.09MB 19.59%  bufio.NewWriter
11.63MB  9.87% 84.30%    11.63MB  9.87%  net/http.(*Transport).putIdleConn
6.50MB  5.52% 89.82%     6.50MB  5.52%  main.main

看起来像是这个问题（https://github.com/golang/go/issues/5794），但是已经在两年前修复了。

英文:

I am loading web pages using simple thread pool, while dynamically loading urls from file. But this small program slowly allocate as much memory as my server has, until omm killer stops it. It looks like resp.Body.Close() doesn't free memory for body text (memory size ~ downloaded pages * avg page size). How can I force golang to free memory allocated for body html text?

package main
import (
&quot;bufio&quot;
&quot;fmt&quot;
&quot;io/ioutil&quot;
&quot;net/http&quot;
&quot;os&quot;
&quot;strings&quot;
&quot;sync&quot;
)
func worker(linkChan chan string, wg *sync.WaitGroup) {
defer wg.Done()
for url := range linkChan {
// Getting body text
resp, err := http.Get(url)
if err != nil {
fmt.Printf(&quot;Fail url: %s\n&quot;, url)
continue
}
body, err := ioutil.ReadAll(resp.Body)
resp.Body.Close()
if err != nil {
fmt.Printf(&quot;Fail url: %s\n&quot;, url)
continue
}
// Test page body
has_rem_code := strings.Contains(string(body), &quot;googleadservices.com/pagead/conversion.js&quot;)
fmt.Printf(&quot;Done url: %s\t%t\n&quot;, url, has_rem_code)
}
}
func main() {
// Creating worker pool
lCh := make(chan string, 30)
wg := new(sync.WaitGroup)
for i := 0; i &lt; 30; i++ {
wg.Add(1)
go worker(lCh, wg)
}
// Opening file with urls
file, err := os.Open(&quot;./tmp/new.csv&quot;)
defer file.Close()
if err != nil {
panic(err)
}
reader := bufio.NewReader(file)
// Processing urls
for href, _, err := reader.ReadLine(); err == nil; href, _, err = reader.ReadLine() {
lCh &lt;- string(href)
}
close(lCh)
wg.Wait()
}

Here is some output from pprof tool:

      flat  flat%   sum%        cum   cum%
34.63MB 29.39% 29.39%    34.63MB 29.39%  bufio.NewReaderSize
30MB 25.46% 54.84%       30MB 25.46%  net/http.(*Transport).getIdleConnCh
23.09MB 19.59% 74.44%    23.09MB 19.59%  bufio.NewWriter
11.63MB  9.87% 84.30%    11.63MB  9.87%  net/http.(*Transport).putIdleConn
6.50MB  5.52% 89.82%     6.50MB  5.52%  main.main

Looks like this issue, but it's fixed 2 years ago.

答案1

得分: 5

在golang-nuts论坛的这个帖子中找到了答案。在我的情况下（数十万个不同的主机），http.Transport会保存连接以便将来重用，导致内存膨胀。但是完全禁用KeepAlives可以解决这个问题。

以下是工作的代码：

func worker(linkChan chan string, wg *sync.WaitGroup) {
	defer wg.Done()

	var transport http.RoundTripper = &http.Transport{
		DisableKeepAlives: true,
	}

	c := &http.Client{Transport: transport}

	for url := range linkChan {
		// 获取页面内容
		resp, err := c.Get(url)
		if err != nil {
			fmt.Printf("失败的URL：%s\n", url)
			continue
		}
		body, err := ioutil.ReadAll(resp.Body)
		resp.Body.Close()
		if err != nil {
			fmt.Printf("失败的URL：%s\n", url)
			continue
		}
		// 检查页面内容
		has_rem_code := strings.Contains(string(body), "googleadservices.com/pagead/conversion.js")
		fmt.Printf("完成的URL：%s\t%t\n", url, has_rem_code)
	}
}

希望对你有帮助！

英文:

Found the answer in this thread on golang-nuts. http.Transport saves connections for future reusing in case of request to same host, causing memory bloating in my case (hundreds thousands of different hosts). But disabling KeepAlives totally solves that problem.

Working code:

func worker(linkChan chan string, wg *sync.WaitGroup) {
defer wg.Done()
var transport http.RoundTripper = &amp;http.Transport{
DisableKeepAlives: true,
}
c := &amp;http.Client{Transport: transport}
for url := range linkChan {
// Getting body text
resp, err := c.Get(url)
if err != nil {
fmt.Printf(&quot;Fail url: %s\n&quot;, url)
continue
}
body, err := ioutil.ReadAll(resp.Body)
resp.Body.Close()
if err != nil {
fmt.Printf(&quot;Fail url: %s\n&quot;, url)
continue
}
// Test page body
has_rem_code := strings.Contains(string(body), &quot;googleadservices.com/pagead/conversion.js&quot;)
fmt.Printf(&quot;Done url: %s\t%t\n&quot;, url, has_rem_code)
}
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Go在使用`http.Get`方法后不会释放内存。

问题

答案1

如何处理未知变量或如何处理多个数据库

嵌套的for循环用于遍历包含数组元素的结构体在golang中无法正常工作。

使用一个通道从多个goroutine接收结果的Go代码。

big.NewInt匿名变量的奇怪行为

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论