英文:
Why does the function return early?
问题
我刚开始学习Go,并且一直在完成教程。最后一个练习是编辑一个网络爬虫,以并行和无重复方式进行爬取。
这是练习的链接:http://tour.golang.org/#70
这是代码。我只改变了crawl和main函数。所以我只会发布这些来保持整洁。
// Crawl使用fetcher递归爬取从url开始的页面,最大深度为depth。
var used = make(map[string]bool)
var urlchan = make(chan string)
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: 并行获取URL。
// Done: 不要重复获取相同的URL。
// 这个实现都没有做到:
done := make(chan bool)
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("\nfound: %s %q\n\n", url, body)
go func() {
for _, i := range urls {
urlchan <- i
}
done <- true
}()
for u := range urlchan {
if used[u] == false {
used[u] = true
go Crawl(u, depth-1, fetcher)
}
if <-done == true {
break
}
}
return
}
func main() {
used["http://golang.org/"] = true
Crawl("http://golang.org/", 4, fetcher)
}
问题是当我运行程序时,爬虫在打印以下内容后停止:
not found: http://golang.org/cmd/
只有当我尝试让程序并行运行时才会出现这个问题。如果我让它线性运行,那么所有的URL都能正确找到。
注意:如果我做错了(指的是并行性),那么我道歉。
英文:
I've just started learning go, and have been working through the tour. The last exercise is to edit a web crawler to crawl in parallel and without repeats.
Here is the link to the exercise: http://tour.golang.org/#70
Here is the code. I only changed the crawl and the main function. So I'll just post those to keep it neat.
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
var used = make(map[string]bool)
var urlchan = make(chan string)
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// Done: Don't fetch the same URL twice.
// This implementation doesn't do either:
done := make(chan bool)
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("\nfound: %s %q\n\n", url, body)
go func() {
for _, i := range urls {
urlchan <- i
}
done <- true
}()
for u := range urlchan {
if used[u] == false {
used[u] = true
go Crawl(u, depth-1, fetcher)
}
if <-done == true {
break
}
}
return
}
func main() {
used["http://golang.org/"] = true
Crawl("http://golang.org/", 4, fetcher)
}
The problem is that when I run the program the crawler stops after printing
not found: http://golang.org/cmd/
This only happens when I try to make the program run in parallel. If I have it run linearly then all the urls are found correctly.
Note: If I am not doing this right (parallelism I mean) then I apologise.
答案1
得分: 1
-
小心使用goroutine。
-
因为当主routine,或者
main()
函数,返回时,所有其他的goroutine都会立即被终止。 -
你的
Crawl()
函数看起来像是递归的,但实际上不是,这意味着它会立即返回,而不会等待其他的Crawl()
函数。而且你知道,如果第一个由main()
调用的Crawl()
函数返回,main()
函数会认为它的任务已经完成。 -
你可以让
main()
函数等待最后一个Crawl()
函数返回。可以使用sync
包或者一个chan
来实现。 -
你可以参考我几个月前写的这个的最后一个解决方案:
var store map[string]bool func Krawl(url string, fetcher Fetcher, Urls chan []string) { body, urls, err := fetcher.Fetch(url) if err != nil { fmt.Println(err) } else { fmt.Printf("found: %s %q\n", url, body) } Urls <- urls } func Crawl(url string, depth int, fetcher Fetcher) { Urls := make(chan []string) go Krawl(url, fetcher, Urls) band := 1 store = true // 初始化第0层 for i := 0; i < depth; i++ { for band > 0 { band-- next := <- Urls for _, url := range next { if _, done := store ; !done { store = true band++ go Krawl(url, fetcher, Urls) } } } } return } func main() { store = make(map[string]bool) Crawl("http://golang.org/", 4, fetcher) }
英文:
-
Be careful with goroutine.
-
Because when the main routine, or
main()
func, returns, all others go routine would be killed immediately. -
Your
Crawl()
seems like recursive, however it is not, which means it would return immediately, not awaiting for otherCrawl()
routines. And you know that if the firstCrawl()
, called bymain()
, returns, themain()
func regards its mission fulfilled. -
What you could do is to let
main()
func wait until the lastCrawl()
returns. Thesync
package, or achan
would help. -
You could probably take a look at the last solution of this, which I did months ago:
var store map[string]bool func Krawl(url string, fetcher Fetcher, Urls chan []string) { body, urls, err := fetcher.Fetch(url) if err != nil { fmt.Println(err) } else { fmt.Printf("found: %s %q\n", url, body) } Urls <- urls } func Crawl(url string, depth int, fetcher Fetcher) { Urls := make(chan []string) go Krawl(url, fetcher, Urls) band := 1 store = true // init for level 0 done for i := 0; i < depth; i++ { for band > 0 { band-- next := <- Urls for _, url := range next { if _, done := store ; !done { store = true band++ go Krawl(url, fetcher, Urls) } } } } return } func main() { store = make(map[string]bool) Crawl("http://golang.org/", 4, fetcher) }
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论