高效的网络爬虫,无需外部依赖。

huangapple go评论92阅读模式
英文:

Performant web-spider with no external dependencies

问题

我正在尝试用Golang编写我的第一个网络爬虫。它的任务是从提供的数据库查询中爬取域名(并检查它们的HTML)。我的想法是尽量不依赖第三方库(例如消息队列),或者尽可能少地依赖,但它必须足够高效,每天能够爬取500万个域名。我大约有1.5亿个域名,我每个月都需要检查一次。

下面是非常基本的版本-它在“无限循环”中运行,因为理论上爬取过程将是无休止的。

func crawl(n time.Duration) {
    var wg sync.WaitGroup
    runtime.GOMAXPROCS(runtime.NumCPU())

    for _ = range time.Tick(n * time.Second) {
        wg.Add(1)

        go func() {
            defer wg.Done()

            // 在这里执行耗时的工作-查询数据库,爬取域名,检查HTML
        }()
    }
    wg.Wait()
}

func main() {
    go crawl(1)

    select{}
}

目前在4个CPU核心上运行此代码意味着它可以在24小时内执行最多345600个请求((60 * 60 * 24)* 4),给定阈值为1秒。至少这是我的理解 高效的网络爬虫,无需外部依赖。 如果我的想法是正确的,那么我需要提出一个14倍更快的解决方案来满足每日的要求。

我希望您能就如何使爬虫更快给出建议,但不要求解复杂的堆栈设置或购买具有更多CPU核心的服务器。

英文:

I'm trying to write my first web-spider in Golang. Its task is to crawl domains (and inspect their html) from the provided database query. The idea is to have no 3rd party dependencies (e.g. msg queue), or as little as possible, yet it has to be performant enough to crawl 5 million domains per day. I have approx 150 million domains I need to check every month.

The very basic version below - it runs in "infinite loop" as theoretically the crawl process would be endless.

func crawl(n time.Duration) {
    var wg sync.WaitGroup
    runtime.GOMAXPROCS(runtime.NumCPU())

    for _ = range time.Tick(n * time.Second) {
        wg.Add(1)

        go func() {
            defer wg.Done()

            // do the expensive work here - query db, crawl domain, inspect html
        }()
    }
    wg.Wait()
}

func main() {
    go crawl(1)

    select{}
}

Running this code on 4 CPU cores at the moment means it can perform max 345600 requests during 24 hours ((60 * 60 * 24) * 4) with the given threshold of 1s. At least that's my understanding 高效的网络爬虫,无需外部依赖。 If my thinking's correct then I will need to come up with solution being 14x faster to meet daily requirements.

I would appreciate your advices in regards to make the crawler faster, but without resolving to complicated stack setup or buying server with more CPU cores.

答案1

得分: 2

为什么需要时间组件?

只需创建一个通道,将URL输入其中,然后生成N个goroutine循环处理该通道中的URL。

然后,根据CPU/内存的使用情况,调整N的值,使其约为90%的利用率(以适应网站响应时间的波动)。

类似于以下代码(在Play上查看):

package main

import "fmt"
import "sync"

var numWorkers = 10

func crawler(urls chan string, wg *sync.WaitGroup) {
    defer wg.Done()
    for u := range urls {
        fmt.Println(u)
    }
}

func main() {
    ch := make(chan string)
    var wg sync.WaitGroup
    for i := 0; i < numWorkers; i++ {
        wg.Add(1)
        go crawler(ch, &wg)
    }
    ch <- "http://ibm.com"
    ch <- "http://google.com"
    close(ch)
    wg.Wait()
    fmt.Println("All Done")
}

在此处查看代码

英文:

Why have the timing component at all?

Just create a channel that you feed URLs to, then spawn N goroutines that loop over that channel and do the work.

then just tweak the value of N until your CPU/memory is capped ~90% utilization (to accommodate fluctuations in site response times)

something like this (on Play):

package main

import &quot;fmt&quot;
import &quot;sync&quot;

var numWorkers = 10

func crawler(urls chan string, wg *sync.WaitGroup) {
	defer wg.Done()
	for u := range urls {
		fmt.Println(u)
	}
}
func main() {
	ch := make(chan string)
	var wg sync.WaitGroup
	for i := 0; i &lt; numWorkers; i++ {
		wg.Add(1)
		go crawler(ch, &amp;wg)
	}
	ch &lt;- &quot;http://ibm.com&quot;
	ch &lt;- &quot;http://google.com&quot;
	close(ch)
	wg.Wait()
	fmt.Println(&quot;All Done&quot;)
}

huangapple
  • 本文由 发表于 2015年12月18日 00:16:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/34339353.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定