为什么这个函数会提前返回?

huangapple go评论89阅读模式
英文:

Why does the function return early?

问题

我刚开始学习Go,并且一直在完成教程。最后一个练习是编辑一个网络爬虫,以并行和无重复方式进行爬取。

这是练习的链接:http://tour.golang.org/#70

这是代码。我只改变了crawl和main函数。所以我只会发布这些来保持整洁。

// Crawl使用fetcher递归爬取从url开始的页面,最大深度为depth。
var used = make(map[string]bool)
var urlchan = make(chan string)
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: 并行获取URL。
// Done: 不要重复获取相同的URL。
// 这个实现都没有做到:
done := make(chan bool)
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("\nfound: %s %q\n\n", url, body)
go func() {
for _, i := range urls {
urlchan <- i
}
done <- true
}()
for u := range urlchan {
if used[u] == false {
used[u] = true
go Crawl(u, depth-1, fetcher)
}
if <-done == true {
break
}
}
return
}

func main() {
used["http://golang.org/"] = true
Crawl("http://golang.org/", 4, fetcher)
}

问题是当我运行程序时,爬虫在打印以下内容后停止:

not found: http://golang.org/cmd/
只有当我尝试让程序并行运行时才会出现这个问题。如果我让它线性运行,那么所有的URL都能正确找到。

注意:如果我做错了(指的是并行性),那么我道歉。

英文:

I've just started learning go, and have been working through the tour. The last exercise is to edit a web crawler to crawl in parallel and without repeats.

Here is the link to the exercise: http://tour.golang.org/#70

Here is the code. I only changed the crawl and the main function. So I'll just post those to keep it neat.

    // Crawl uses fetcher to recursively crawl
    // pages starting with url, to a maximum of depth.
    var used = make(map[string]bool)
    var urlchan = make(chan string)
    func Crawl(url string, depth int, fetcher Fetcher) {
        // TODO: Fetch URLs in parallel.
        // Done: Don&#39;t fetch the same URL twice.
        // This implementation doesn&#39;t do either:
        done := make(chan bool)
    	if depth &lt;= 0 {
    		return
    	}
    	body, urls, err := fetcher.Fetch(url)
    	if err != nil {
    		fmt.Println(err)
    		return
    	}
    	fmt.Printf(&quot;\nfound: %s %q\n\n&quot;, url, body)
        go func() {
            for _, i := range urls {
                urlchan &lt;- i
            }
            done &lt;- true
        }()
    	for u := range urlchan {
            if used[u] == false {
                used[u] = true
                go Crawl(u, depth-1, fetcher)
            }
            if &lt;-done == true {
                break
            }
    	}
    	return
    }
    
    func main() {
        used[&quot;http://golang.org/&quot;] = true
        Crawl(&quot;http://golang.org/&quot;, 4, fetcher)
    }

The problem is that when I run the program the crawler stops after printing

    not found: http://golang.org/cmd/

This only happens when I try to make the program run in parallel. If I have it run linearly then all the urls are found correctly.

Note: If I am not doing this right (parallelism I mean) then I apologise.

答案1

得分: 1

  • 小心使用goroutine。

  • 因为当主routine,或者main()函数,返回时,所有其他的goroutine都会立即被终止。

  • 你的Crawl()函数看起来像是递归的,但实际上不是,这意味着它会立即返回,而不会等待其他的Crawl()函数。而且你知道,如果第一个由main()调用的Crawl()函数返回,main()函数会认为它的任务已经完成。

  • 你可以让main()函数等待最后一个Crawl()函数返回。可以使用sync包或者一个chan来实现。

  • 你可以参考我几个月前写的这个的最后一个解决方案:

      var store map[string]bool
    
      func Krawl(url string, fetcher Fetcher, Urls chan []string) {
          body, urls, err := fetcher.Fetch(url)
          if err != nil {
              fmt.Println(err)
          } else {
              fmt.Printf("found: %s %q\n", url, body)
          }
          Urls <- urls
      }
    
      func Crawl(url string, depth int, fetcher Fetcher) {
          Urls := make(chan []string)
          go Krawl(url, fetcher, Urls)
          band := 1
          store
    = true // 初始化第0层 for i := 0; i < depth; i++ { for band > 0 { band-- next := <- Urls for _, url := range next { if _, done := store
    ; !done { store
    = true band++ go Krawl(url, fetcher, Urls) } } } } return } func main() { store = make(map[string]bool) Crawl("http://golang.org/", 4, fetcher) }
英文:
  • Be careful with goroutine.

  • Because when the main routine, or main() func, returns, all others go routine would be killed immediately.

  • Your Crawl() seems like recursive, however it is not, which means it would return immediately, not awaiting for other Crawl() routines. And you know that if the first Crawl(), called by main(), returns, the main() func regards its mission fulfilled.

  • What you could do is to let main() func wait until the last Crawl() returns. The sync package, or a chan would help.

  • You could probably take a look at the last solution of this, which I did months ago:

      var store map[string]bool
    
      func Krawl(url string, fetcher Fetcher, Urls chan []string) {
          body, urls, err := fetcher.Fetch(url)
          if err != nil {
              fmt.Println(err)
          } else {
              fmt.Printf(&quot;found: %s %q\n&quot;, url, body)
          }
          Urls &lt;- urls
      }
    
      func Crawl(url string, depth int, fetcher Fetcher) {
          Urls := make(chan []string)
          go Krawl(url, fetcher, Urls)
          band := 1
          store
    = true // init for level 0 done for i := 0; i &lt; depth; i++ { for band &gt; 0 { band-- next := &lt;- Urls for _, url := range next { if _, done := store
    ; !done { store
    = true band++ go Krawl(url, fetcher, Urls) } } } } return } func main() { store = make(map[string]bool) Crawl(&quot;http://golang.org/&quot;, 4, fetcher) }

huangapple
  • 本文由 发表于 2012年9月1日 10:23:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/12224412.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定