2015年12月18日 00:16:41go评论118阅读模式

英文:

Performant web-spider with no external dependencies

问题

我正在尝试用Golang编写我的第一个网络爬虫。它的任务是从提供的数据库查询中爬取域名（并检查它们的HTML）。我的想法是尽量不依赖第三方库（例如消息队列），或者尽可能少地依赖，但它必须足够高效，每天能够爬取500万个域名。我大约有1.5亿个域名，我每个月都需要检查一次。

下面是非常基本的版本-它在“无限循环”中运行，因为理论上爬取过程将是无休止的。

func crawl(n time.Duration) {
    var wg sync.WaitGroup
    runtime.GOMAXPROCS(runtime.NumCPU())
    for _ = range time.Tick(n * time.Second) {
        wg.Add(1)
        go func() {
            defer wg.Done()
            // 在这里执行耗时的工作-查询数据库，爬取域名，检查HTML
        }()
    }
    wg.Wait()
}
func main() {
    go crawl(1)
    select{}
}

目前在4个CPU核心上运行此代码意味着它可以在24小时内执行最多345600个请求（（60 * 60 * 24）* 4），给定阈值为1秒。至少这是我的理解如果我的想法是正确的，那么我需要提出一个14倍更快的解决方案来满足每日的要求。

我希望您能就如何使爬虫更快给出建议，但不要求解复杂的堆栈设置或购买具有更多CPU核心的服务器。

英文:

I'm trying to write my first web-spider in Golang. Its task is to crawl domains (and inspect their html) from the provided database query. The idea is to have no 3rd party dependencies (e.g. msg queue), or as little as possible, yet it has to be performant enough to crawl 5 million domains per day. I have approx 150 million domains I need to check every month.

The very basic version below - it runs in "infinite loop" as theoretically the crawl process would be endless.

func crawl(n time.Duration) {
    var wg sync.WaitGroup
    runtime.GOMAXPROCS(runtime.NumCPU())
    for _ = range time.Tick(n * time.Second) {
        wg.Add(1)
        go func() {
            defer wg.Done()
            // do the expensive work here - query db, crawl domain, inspect html
        }()
    }
    wg.Wait()
}
func main() {
    go crawl(1)
    select{}
}

Running this code on 4 CPU cores at the moment means it can perform max 345600 requests during 24 hours ((60 * 60 * 24) * 4) with the given threshold of 1s. At least that's my understanding If my thinking's correct then I will need to come up with solution being 14x faster to meet daily requirements.

I would appreciate your advices in regards to make the crawler faster, but without resolving to complicated stack setup or buying server with more CPU cores.

答案1

得分: 2

为什么需要时间组件？

只需创建一个通道，将URL输入其中，然后生成N个goroutine循环处理该通道中的URL。

然后，根据CPU/内存的使用情况，调整N的值，使其约为90%的利用率（以适应网站响应时间的波动）。

类似于以下代码（在Play上查看）：

package main
import "fmt"
import "sync"
var numWorkers = 10
func crawler(urls chan string, wg *sync.WaitGroup) {
    defer wg.Done()
    for u := range urls {
        fmt.Println(u)
    }
}
func main() {
    ch := make(chan string)
    var wg sync.WaitGroup
    for i := 0; i < numWorkers; i++ {
        wg.Add(1)
        go crawler(ch, &wg)
    }
    ch <- "http://ibm.com"
    ch <- "http://google.com"
    close(ch)
    wg.Wait()
    fmt.Println("All Done")
}

在此处查看代码

英文:

Why have the timing component at all?

Just create a channel that you feed URLs to, then spawn N goroutines that loop over that channel and do the work.

then just tweak the value of N until your CPU/memory is capped ~90% utilization (to accommodate fluctuations in site response times)

something like this (on Play):

package main
import &quot;fmt&quot;
import &quot;sync&quot;
var numWorkers = 10
func crawler(urls chan string, wg *sync.WaitGroup) {
	defer wg.Done()
	for u := range urls {
		fmt.Println(u)
	}
}
func main() {
	ch := make(chan string)
	var wg sync.WaitGroup
	for i := 0; i &lt; numWorkers; i++ {
		wg.Add(1)
		go crawler(ch, &amp;wg)
	}
	ch &lt;- &quot;http://ibm.com&quot;
	ch &lt;- &quot;http://google.com&quot;
	close(ch)
	wg.Wait()
	fmt.Println(&quot;All Done&quot;)
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

高效的网络爬虫，无需外部依赖。

问题

答案1

Golang：在 := 的左侧没有新的变量，而类似的情况没有出现这个错误。

使用Golang获取CSV数据

How to set time to dd-MMM-yyyy HH:mm:ss in go?

恐慌：运行时错误：无效的内存地址或空指针解引用（一次又一次）

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。