在for循环中使用goroutine导致了意外的行为。

huangapple go评论111阅读模式
英文:

Goroutine in for loop causes unexpected behavior

问题

我正在为你翻译以下内容:

我正在完成《Go之旅》中的Web爬虫练习

我尝试使用并发的互斥锁(Mutex)来解决问题,参考了这里找到的一个解决方案。我对其进行了修改,以适应原始问题中的预定义签名。然而,在URL树的第二层时,爬虫停止了。在调试过程中,打印语句的不同行为完全让我困惑了:

  1. var done sync.WaitGroup
  2. for _, u := range urls {
  3. done.Add(1)
  4. fmt.Printf("enter: %s\n", u) // 这里
  5. go func(url string) {
  6. defer done.Done()
  7. Crawl(u, depth-1, fetcher, f)
  8. }(u)
  9. }
  10. done.Wait()

如果我将打印语句放在goroutine之外,输出是符合预期的。但我不知道为什么会停在那里。

  1. enter: https://golang.org/pkg/
  2. enter: https://golang.org/cmd/

但是,如果我将打印语句放在goroutine内部,即

  1. var done sync.WaitGroup
  2. for _, u := range urls {
  3. done.Add(1)
  4. go func(url string) {
  5. defer done.Done()
  6. fmt.Printf("enter: %s\n", u) // 这里
  7. Crawl(u, depth-1, fetcher, f)
  8. }(u)
  9. }
  10. done.Wait()

输出变成了

  1. enter: https://golang.org/cmd/
  2. enter: https://golang.org/cmd/

我有两个问题:

  1. 在第二种情况下,为什么会打印两次 enter: https://golang.org/cmd/
  2. 为什么Crawl函数会在出现错误时停止,而不是继续遍历URL树?

PS:第二个问题可能与第一个问题有关。我故意在goroutine内部将 u 改为 url,以重现困扰我的错误。

以下是我修改后的解决方案:

  1. package main
  2. import (
  3. "fmt"
  4. "sync"
  5. )
  6. type Fetcher interface {
  7. // Fetch返回URL的内容和在该页面上找到的URL切片。
  8. Fetch(url string) (body string, urls []string, err error)
  9. }
  10. type fetchState struct {
  11. mu sync.Mutex
  12. fetched map[string]bool
  13. }
  14. // Crawl使用fetcher递归地爬取以url为起点的页面,最大深度为depth。
  15. func Crawl(url string, depth int, fetcher Fetcher, f *fetchState) {
  16. // TODO:并行获取URL。
  17. // TODO:不要重复获取相同的URL。
  18. // 这个实现两者都没有做到:
  19. f.mu.Lock()
  20. already := f.fetched[url]
  21. f.fetched[url] = true
  22. f.mu.Unlock()
  23. if already {
  24. return
  25. }
  26. if depth <= 0 {
  27. return
  28. }
  29. body, urls, err := fetcher.Fetch(url)
  30. if err != nil {
  31. fmt.Println(err)
  32. return
  33. }
  34. fmt.Printf("found: %s %q\n", url, body)
  35. var done sync.WaitGroup
  36. for _, u := range urls {
  37. done.Add(1)
  38. go func(url string) {
  39. defer done.Done()
  40. fmt.Printf("enter: %s\n", u)
  41. Crawl(u, depth-1, fetcher, f)
  42. }(u)
  43. }
  44. done.Wait()
  45. return
  46. }
  47. func makeState() *fetchState {
  48. f := &fetchState{}
  49. f.fetched = make(map[string]bool)
  50. return f
  51. }
  52. func main() {
  53. Crawl("https://golang.org/", 4, fetcher, makeState())
  54. }
  55. // fakeFetcher是一个返回预定义结果的Fetcher。
  56. type fakeFetcher map[string]*fakeResult
  57. type fakeResult struct {
  58. body string
  59. urls []string
  60. }
  61. func (f fakeFetcher) Fetch(url string) (string, []string, error) {
  62. if res, ok := f[url]; ok {
  63. return res.body, res.urls, nil
  64. }
  65. return "", nil, fmt.Errorf("not found: %s", url)
  66. }
  67. // fetcher是一个填充了预定义结果的fakeFetcher。
  68. var fetcher = fakeFetcher{
  69. "https://golang.org/": &fakeResult{
  70. "The Go Programming Language",
  71. []string{
  72. "https://golang.org/pkg/",
  73. "https://golang.org/cmd/",
  74. },
  75. },
  76. "https://golang.org/pkg/": &fakeResult{
  77. "Packages",
  78. []string{
  79. "https://golang.org/",
  80. "https://golang.org/cmd/",
  81. "https://golang.org/pkg/fmt/",
  82. "https://golang.org/pkg/os/",
  83. },
  84. },
  85. "https://golang.org/pkg/fmt/": &fakeResult{
  86. "Package fmt",
  87. []string{
  88. "https://golang.org/",
  89. "https://golang.org/pkg/",
  90. },
  91. },
  92. "https://golang.org/pkg/os/": &fakeResult{
  93. "Package os",
  94. []string{
  95. "https://golang.org/",
  96. "https://golang.org/pkg/",
  97. },
  98. },
  99. }
英文:

I was doing the Web Crawler Exercise in A Tour of Go.

I was trying to use concurrent Mutex to solve the question, based on a solution found here. I modified it to fit the pre-defined signatures in the original question. However, the crawler stops at the second level of the URL tree. During debugging, the different behaviors of the print statements completely confused me:

  1. var done sync.WaitGroup
  2. for _, u := range urls {
  3. done.Add(1)
  4. fmt.Printf(&quot;enter: %s\n&quot;, u) // here
  5. go func(url string) {
  6. defer done.Done()
  7. Crawl(u, depth-1, fetcher, f)
  8. }(u)
  9. }
  10. done.Wait()

If I put the print statement outside the goroutine, outputs are expected. But I didn't know why it stops there.

  1. enter: https://golang.org/pkg/
  2. enter: https://golang.org/cmd/

But if I put the print statement inside the goroutine, that is

  1. var done sync.WaitGroup
  2. for _, u := range urls {
  3. done.Add(1)
  4. go func(url string) {
  5. defer done.Done()
  6. fmt.Printf(&quot;enter: %s\n&quot;, u) // here
  7. Crawl(u, depth-1, fetcher, f)
  8. }(u)
  9. }
  10. done.Wait()

The output becomes

  1. enter: https://golang.org/cmd/
  2. enter: https://golang.org/cmd/

I have two questions:

  1. In the second case, why enter: https://golang.org/cmd/ gets printed twice?
  2. Why does the Crawl function stop at an error, instead of keeping traversing the URL tree?

PS: the second question might be related to the first one. I intentionally made u instead of url inside the goroutine to reproduce the bug that confused me.

Below is my modified solution

  1. package main
  2. import (
  3. &quot;fmt&quot;
  4. &quot;sync&quot;
  5. )
  6. type Fetcher interface {
  7. // Fetch returns the body of URL and
  8. // a slice of URLs found on that page.
  9. Fetch(url string) (body string, urls []string, err error)
  10. }
  11. type fetchState struct {
  12. mu sync.Mutex
  13. fetched map[string]bool
  14. }
  15. // Crawl uses fetcher to recursively crawl
  16. // pages starting with url, to a maximum of depth.
  17. func Crawl(url string, depth int, fetcher Fetcher, f *fetchState) {
  18. // TODO: Fetch URLs in parallel.
  19. // TODO: Don&#39;t fetch the same URL twice.
  20. // This implementation doesn&#39;t do either:
  21. f.mu.Lock()
  22. already := f.fetched[url]
  23. f.fetched[url] = true
  24. f.mu.Unlock()
  25. if already {
  26. return
  27. }
  28. if depth &lt;= 0 {
  29. return
  30. }
  31. body, urls, err := fetcher.Fetch(url)
  32. if err != nil {
  33. fmt.Println(err)
  34. return
  35. }
  36. fmt.Printf(&quot;found: %s %q\n&quot;, url, body)
  37. var done sync.WaitGroup
  38. for _, u := range urls {
  39. done.Add(1)
  40. go func(url string) {
  41. defer done.Done()
  42. fmt.Printf(&quot;enter: %s\n&quot;, u)
  43. Crawl(u, depth-1, fetcher, f)
  44. }(u)
  45. }
  46. done.Wait()
  47. return
  48. }
  49. func makeState() *fetchState{
  50. f := &amp;fetchState{}
  51. f.fetched = make(map[string]bool)
  52. return f
  53. }
  54. func main() {
  55. Crawl(&quot;https://golang.org/&quot;, 4, fetcher, makeState())
  56. }
  57. // fakeFetcher is Fetcher that returns canned results.
  58. type fakeFetcher map[string]*fakeResult
  59. type fakeResult struct {
  60. body string
  61. urls []string
  62. }
  63. func (f fakeFetcher) Fetch(url string) (string, []string, error) {
  64. if res, ok := f[url]; ok {
  65. return res.body, res.urls, nil
  66. }
  67. return &quot;&quot;, nil, fmt.Errorf(&quot;not found: %s&quot;, url)
  68. }
  69. // fetcher is a populated fakeFetcher.
  70. var fetcher = fakeFetcher{
  71. &quot;https://golang.org/&quot;: &amp;fakeResult{
  72. &quot;The Go Programming Language&quot;,
  73. []string{
  74. &quot;https://golang.org/pkg/&quot;,
  75. &quot;https://golang.org/cmd/&quot;,
  76. },
  77. },
  78. &quot;https://golang.org/pkg/&quot;: &amp;fakeResult{
  79. &quot;Packages&quot;,
  80. []string{
  81. &quot;https://golang.org/&quot;,
  82. &quot;https://golang.org/cmd/&quot;,
  83. &quot;https://golang.org/pkg/fmt/&quot;,
  84. &quot;https://golang.org/pkg/os/&quot;,
  85. },
  86. },
  87. &quot;https://golang.org/pkg/fmt/&quot;: &amp;fakeResult{
  88. &quot;Package fmt&quot;,
  89. []string{
  90. &quot;https://golang.org/&quot;,
  91. &quot;https://golang.org/pkg/&quot;,
  92. },
  93. },
  94. &quot;https://golang.org/pkg/os/&quot;: &amp;fakeResult{
  95. &quot;Package os&quot;,
  96. []string{
  97. &quot;https://golang.org/&quot;,
  98. &quot;https://golang.org/pkg/&quot;,
  99. },
  100. },
  101. }

答案1

得分: 2

欢迎来到Stack Overflow!

在你的函数中,你将url定义为参数,但在其中一直使用u。循环变量u被函数字面量捕获。

尝试这样做:

  1. var done sync.WaitGroup
  2. for _, u := range urls {
  3. done.Add(1)
  4. go func(url string) {
  5. defer done.Done()
  6. fmt.Printf("enter: %s\n", url) // <- 检查区别
  7. Crawl(url, depth-1, fetcher, f) // <- 检查区别
  8. }(u)
  9. }
  10. done.Wait()

关于为什么u变量打印相同的值,这是一个非常常见的错误:https://github.com/golang/go/wiki/CommonMistakes#using-goroutines-on-loop-iterator-variables

简而言之,Go语言通过引用将单个变量传递给goroutine。当它们执行时,它们可能会在其中找到迭代的最后一个值。

我找到了一篇详细解释的好文章:https://eli.thegreenplace.net/2019/go-internals-capturing-loop-variables-in-closures/

英文:

Welcome to Stack Overflow!

In you function, you defined url as the parameter, but kept using u inside of it.
The loop variable u captured by func literal.

Try doing this:

  1. var done sync.WaitGroup
  2. for _, u := range urls {
  3. done.Add(1)
  4. go func(url string) {
  5. defer done.Done()
  6. fmt.Printf(&quot;enter: %s\n&quot;, url) // &lt;- check the difference
  7. Crawl(url, depth-1, fetcher, f) // &lt;- check the difference
  8. }(u)
  9. }
  10. done.Wait()

For why the same value was being printed with the u variable, this is a very common mistake: https://github.com/golang/go/wiki/CommonMistakes#using-goroutines-on-loop-iterator-variables

In short, the go is passing a single variable by reference to the goroutines. When they execute, they are probably going to find the last value of the iteration in it.

I found this neat article that explains it in detail: https://eli.thegreenplace.net/2019/go-internals-capturing-loop-variables-in-closures/

huangapple
  • 本文由 发表于 2022年8月26日 05:47:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/73493998.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定