英文:
Performant web-spider with no external dependencies
问题
我正在尝试用Golang编写我的第一个网络爬虫。它的任务是从提供的数据库查询中爬取域名(并检查它们的HTML)。我的想法是尽量不依赖第三方库(例如消息队列),或者尽可能少地依赖,但它必须足够高效,每天能够爬取500万个域名。我大约有1.5亿个域名,我每个月都需要检查一次。
下面是非常基本的版本-它在“无限循环”中运行,因为理论上爬取过程将是无休止的。
func crawl(n time.Duration) {
var wg sync.WaitGroup
runtime.GOMAXPROCS(runtime.NumCPU())
for _ = range time.Tick(n * time.Second) {
wg.Add(1)
go func() {
defer wg.Done()
// 在这里执行耗时的工作-查询数据库,爬取域名,检查HTML
}()
}
wg.Wait()
}
func main() {
go crawl(1)
select{}
}
目前在4个CPU核心上运行此代码意味着它可以在24小时内执行最多345600个请求((60 * 60 * 24)* 4),给定阈值为1秒。至少这是我的理解 如果我的想法是正确的,那么我需要提出一个14倍更快的解决方案来满足每日的要求。
我希望您能就如何使爬虫更快给出建议,但不要求解复杂的堆栈设置或购买具有更多CPU核心的服务器。
英文:
I'm trying to write my first web-spider in Golang. Its task is to crawl domains (and inspect their html) from the provided database query. The idea is to have no 3rd party dependencies (e.g. msg queue), or as little as possible, yet it has to be performant enough to crawl 5 million domains per day. I have approx 150 million domains I need to check every month.
The very basic version below - it runs in "infinite loop" as theoretically the crawl process would be endless.
func crawl(n time.Duration) {
var wg sync.WaitGroup
runtime.GOMAXPROCS(runtime.NumCPU())
for _ = range time.Tick(n * time.Second) {
wg.Add(1)
go func() {
defer wg.Done()
// do the expensive work here - query db, crawl domain, inspect html
}()
}
wg.Wait()
}
func main() {
go crawl(1)
select{}
}
Running this code on 4 CPU cores at the moment means it can perform max 345600 requests during 24 hours ((60 * 60 * 24) * 4) with the given threshold of 1s. At least that's my understanding If my thinking's correct then I will need to come up with solution being 14x faster to meet daily requirements.
I would appreciate your advices in regards to make the crawler faster, but without resolving to complicated stack setup or buying server with more CPU cores.
答案1
得分: 2
为什么需要时间组件?
只需创建一个通道,将URL输入其中,然后生成N个goroutine循环处理该通道中的URL。
然后,根据CPU/内存的使用情况,调整N的值,使其约为90%的利用率(以适应网站响应时间的波动)。
类似于以下代码(在Play上查看):
package main
import "fmt"
import "sync"
var numWorkers = 10
func crawler(urls chan string, wg *sync.WaitGroup) {
defer wg.Done()
for u := range urls {
fmt.Println(u)
}
}
func main() {
ch := make(chan string)
var wg sync.WaitGroup
for i := 0; i < numWorkers; i++ {
wg.Add(1)
go crawler(ch, &wg)
}
ch <- "http://ibm.com"
ch <- "http://google.com"
close(ch)
wg.Wait()
fmt.Println("All Done")
}
英文:
Why have the timing component at all?
Just create a channel that you feed URLs to, then spawn N goroutines that loop over that channel and do the work.
then just tweak the value of N until your CPU/memory is capped ~90% utilization (to accommodate fluctuations in site response times)
something like this (on Play):
package main
import "fmt"
import "sync"
var numWorkers = 10
func crawler(urls chan string, wg *sync.WaitGroup) {
defer wg.Done()
for u := range urls {
fmt.Println(u)
}
}
func main() {
ch := make(chan string)
var wg sync.WaitGroup
for i := 0; i < numWorkers; i++ {
wg.Add(1)
go crawler(ch, &wg)
}
ch <- "http://ibm.com"
ch <- "http://google.com"
close(ch)
wg.Wait()
fmt.Println("All Done")
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论