问题

我刚开始学习Go，并且一直在完成教程。最后一个练习是编辑一个网络爬虫，以并行和无重复方式进行爬取。

这是练习的链接：http://tour.golang.org/#70

这是代码。我只改变了crawl和main函数。所以我只会发布这些来保持整洁。

// Crawl使用fetcher递归爬取从url开始的页面，最大深度为depth。
var used = make(map[string]bool)
var urlchan = make(chan string)
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: 并行获取URL。
// Done: 不要重复获取相同的URL。
// 这个实现都没有做到：
done := make(chan bool)
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("\nfound: %s %q\n\n", url, body)
go func() {
for _, i := range urls {
urlchan <- i
}
done <- true
}()
for u := range urlchan {
if used[u] == false {
used[u] = true
go Crawl(u, depth-1, fetcher)
}
if <-done == true {
break
}
}
return
}

func main() {
used["http://golang.org/"] = true
Crawl("http://golang.org/", 4, fetcher)
}

问题是当我运行程序时，爬虫在打印以下内容后停止：

not found: http://golang.org/cmd/
只有当我尝试让程序并行运行时才会出现这个问题。如果我让它线性运行，那么所有的URL都能正确找到。

注意：如果我做错了（指的是并行性），那么我道歉。

英文:

I've just started learning go, and have been working through the tour. The last exercise is to edit a web crawler to crawl in parallel and without repeats.

Here is the link to the exercise: http://tour.golang.org/#70

Here is the code. I only changed the crawl and the main function. So I'll just post those to keep it neat.

    // Crawl uses fetcher to recursively crawl
    // pages starting with url, to a maximum of depth.
    var used = make(map[string]bool)
    var urlchan = make(chan string)
    func Crawl(url string, depth int, fetcher Fetcher) {
        // TODO: Fetch URLs in parallel.
        // Done: Don&#39;t fetch the same URL twice.
        // This implementation doesn&#39;t do either:
        done := make(chan bool)
    	if depth &lt;= 0 {
    		return
    	}
    	body, urls, err := fetcher.Fetch(url)
    	if err != nil {
    		fmt.Println(err)
    		return
    	}
    	fmt.Printf(&quot;\nfound: %s %q\n\n&quot;, url, body)
        go func() {
            for _, i := range urls {
                urlchan &lt;- i
            }
            done &lt;- true
        }()
    	for u := range urlchan {
            if used[u] == false {
                used[u] = true
                go Crawl(u, depth-1, fetcher)
            }
            if &lt;-done == true {
                break
            }
    	}
    	return
    }
    
    func main() {
        used[&quot;http://golang.org/&quot;] = true
        Crawl(&quot;http://golang.org/&quot;, 4, fetcher)
    }

The problem is that when I run the program the crawler stops after printing

    not found: http://golang.org/cmd/

This only happens when I try to make the program run in parallel. If I have it run linearly then all the urls are found correctly.

Note: If I am not doing this right (parallelism I mean) then I apologise.

答案1

得分: 1

小心使用goroutine。
因为当主routine，或者main()函数，返回时，所有其他的goroutine都会立即被终止。
你的Crawl()函数看起来像是递归的，但实际上不是，这意味着它会立即返回，而不会等待其他的Crawl()函数。而且你知道，如果第一个由main()调用的Crawl()函数返回，main()函数会认为它的任务已经完成。
你可以让main()函数等待最后一个Crawl()函数返回。可以使用sync包或者一个chan来实现。

你可以参考我几个月前写的这个的最后一个解决方案：

  var store map[string]bool

  func Krawl(url string, fetcher Fetcher, Urls chan []string) {
      body, urls, err := fetcher.Fetch(url)
      if err != nil {
          fmt.Println(err)
      } else {
          fmt.Printf("found: %s %q\n", url, body)
      }
      Urls <- urls
  }

  func Crawl(url string, depth int, fetcher Fetcher) {
      Urls := make(chan []string)
      go Krawl(url, fetcher, Urls)
      band := 1
      store
 = true // 初始化第0层
      for i := 0; i < depth; i++ {
          for band > 0 {
              band--
              next := <- Urls
              for _, url := range next {
                  if _, done := store
 ; !done {
                      store
 = true
                      band++
                      go Krawl(url, fetcher, Urls)
                  }
              }
          }
      }
      return
  }

  func main() {
      store = make(map[string]bool)
      Crawl("http://golang.org/", 4, fetcher)
  }

英文:

Be careful with goroutine.
Because when the main routine, or main() func, returns, all others go routine would be killed immediately.
Your Crawl() seems like recursive, however it is not, which means it would return immediately, not awaiting for other Crawl() routines. And you know that if the first Crawl(), called by main(), returns, the main() func regards its mission fulfilled.
What you could do is to let main() func wait until the last Crawl() returns. The sync package, or a chan would help.

You could probably take a look at the last solution of this, which I did months ago:

  var store map[string]bool

  func Krawl(url string, fetcher Fetcher, Urls chan []string) {
      body, urls, err := fetcher.Fetch(url)
      if err != nil {
          fmt.Println(err)
      } else {
          fmt.Printf(&quot;found: %s %q\n&quot;, url, body)
      }
      Urls &lt;- urls
  }

  func Crawl(url string, depth int, fetcher Fetcher) {
      Urls := make(chan []string)
      go Krawl(url, fetcher, Urls)
      band := 1
      store
 = true // init for level 0 done
      for i := 0; i &lt; depth; i++ {
          for band &gt; 0 {
              band--
              next := &lt;- Urls
              for _, url := range next {
                  if _, done := store
 ; !done {
                      store
 = true
                      band++
                      go Krawl(url, fetcher, Urls)
                  }
              }
          }
      }
      return
  }

  func main() {
      store = make(map[string]bool)
      Crawl(&quot;http://golang.org/&quot;, 4, fetcher)
  }

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么这个函数会提前返回？

问题

答案1

如何在Go语言的Fyne中隐藏任务栏？

在Go语言中，可以使用函数指针来处理多种类型的情况。

Using protoc-gen-go creates a .pb.go that imports google/golang but can't find package

goroutines和boost.fiber之间的区别是什么？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论