在for循环中使用goroutine导致了意外的行为。

huangapple go评论84阅读模式
英文:

Goroutine in for loop causes unexpected behavior

问题

我正在为你翻译以下内容:

我正在完成《Go之旅》中的Web爬虫练习

我尝试使用并发的互斥锁(Mutex)来解决问题,参考了这里找到的一个解决方案。我对其进行了修改,以适应原始问题中的预定义签名。然而,在URL树的第二层时,爬虫停止了。在调试过程中,打印语句的不同行为完全让我困惑了:

	var done sync.WaitGroup
	for _, u := range urls {
		done.Add(1)
		fmt.Printf("enter: %s\n", u) // 这里
		go func(url string) {
			defer done.Done()
			Crawl(u, depth-1, fetcher, f)
		}(u)
	}
	done.Wait()

如果我将打印语句放在goroutine之外,输出是符合预期的。但我不知道为什么会停在那里。

enter: https://golang.org/pkg/
enter: https://golang.org/cmd/

但是,如果我将打印语句放在goroutine内部,即

	var done sync.WaitGroup
	for _, u := range urls {
		done.Add(1)
		go func(url string) {
			defer done.Done()
			fmt.Printf("enter: %s\n", u) // 这里
			Crawl(u, depth-1, fetcher, f)
		}(u)
	}
	done.Wait()

输出变成了

enter: https://golang.org/cmd/
enter: https://golang.org/cmd/

我有两个问题:

  1. 在第二种情况下,为什么会打印两次 enter: https://golang.org/cmd/
  2. 为什么Crawl函数会在出现错误时停止,而不是继续遍历URL树?

PS:第二个问题可能与第一个问题有关。我故意在goroutine内部将 u 改为 url,以重现困扰我的错误。

以下是我修改后的解决方案:

package main

import (
	"fmt"
	"sync"
)

type Fetcher interface {
	// Fetch返回URL的内容和在该页面上找到的URL切片。
	Fetch(url string) (body string, urls []string, err error)
}

type fetchState struct {
	mu      sync.Mutex
	fetched map[string]bool
}

// Crawl使用fetcher递归地爬取以url为起点的页面,最大深度为depth。
func Crawl(url string, depth int, fetcher Fetcher, f *fetchState) {
	// TODO:并行获取URL。
	// TODO:不要重复获取相同的URL。
	// 这个实现两者都没有做到:
	f.mu.Lock()
	already := f.fetched[url]
	f.fetched[url] = true
	f.mu.Unlock()

	if already {
		return
	}

	if depth <= 0 {
		return
	}
	body, urls, err := fetcher.Fetch(url)
	if err != nil {
		fmt.Println(err)
		return
	}
	fmt.Printf("found: %s %q\n", url, body)

	var done sync.WaitGroup
	for _, u := range urls {
		done.Add(1)
		go func(url string) {
			defer done.Done()
			fmt.Printf("enter: %s\n", u)
			Crawl(u, depth-1, fetcher, f)
		}(u)
	}
	done.Wait()

	return
}

func makeState() *fetchState {
	f := &fetchState{}
	f.fetched = make(map[string]bool)
	return f
}

func main() {
	Crawl("https://golang.org/", 4, fetcher, makeState())
}

// fakeFetcher是一个返回预定义结果的Fetcher。
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
	body string
	urls []string
}

func (f fakeFetcher) Fetch(url string) (string, []string, error) {
	if res, ok := f[url]; ok {
		return res.body, res.urls, nil
	}
	return "", nil, fmt.Errorf("not found: %s", url)
}

// fetcher是一个填充了预定义结果的fakeFetcher。
var fetcher = fakeFetcher{
	"https://golang.org/": &fakeResult{
		"The Go Programming Language",
		[]string{
			"https://golang.org/pkg/",
			"https://golang.org/cmd/",
		},
	},
	"https://golang.org/pkg/": &fakeResult{
		"Packages",
		[]string{
			"https://golang.org/",
			"https://golang.org/cmd/",
			"https://golang.org/pkg/fmt/",
			"https://golang.org/pkg/os/",
		},
	},
	"https://golang.org/pkg/fmt/": &fakeResult{
		"Package fmt",
		[]string{
			"https://golang.org/",
			"https://golang.org/pkg/",
		},
	},
	"https://golang.org/pkg/os/": &fakeResult{
		"Package os",
		[]string{
			"https://golang.org/",
			"https://golang.org/pkg/",
		},
	},
}
英文:

I was doing the Web Crawler Exercise in A Tour of Go.

I was trying to use concurrent Mutex to solve the question, based on a solution found here. I modified it to fit the pre-defined signatures in the original question. However, the crawler stops at the second level of the URL tree. During debugging, the different behaviors of the print statements completely confused me:

	var done sync.WaitGroup
	for _, u := range urls {
		done.Add(1)
		fmt.Printf(&quot;enter: %s\n&quot;, u) // here
		go func(url string) {
			defer done.Done()
			Crawl(u, depth-1, fetcher, f)
		}(u)
	}
	done.Wait()

If I put the print statement outside the goroutine, outputs are expected. But I didn't know why it stops there.

enter: https://golang.org/pkg/
enter: https://golang.org/cmd/

But if I put the print statement inside the goroutine, that is

	var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
go func(url string) {
defer done.Done()
fmt.Printf(&quot;enter: %s\n&quot;, u) // here
Crawl(u, depth-1, fetcher, f)
}(u)
}
done.Wait()

The output becomes

enter: https://golang.org/cmd/
enter: https://golang.org/cmd/

I have two questions:

  1. In the second case, why enter: https://golang.org/cmd/ gets printed twice?
  2. Why does the Crawl function stop at an error, instead of keeping traversing the URL tree?

PS: the second question might be related to the first one. I intentionally made u instead of url inside the goroutine to reproduce the bug that confused me.

Below is my modified solution

package main

import (
	&quot;fmt&quot;
	&quot;sync&quot;
)

type Fetcher interface {
	// Fetch returns the body of URL and
	// a slice of URLs found on that page.
	Fetch(url string) (body string, urls []string, err error)
}

type fetchState struct {
	mu sync.Mutex
	fetched map[string]bool
}

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, f *fetchState) {
	// TODO: Fetch URLs in parallel.
	// TODO: Don&#39;t fetch the same URL twice.
	// This implementation doesn&#39;t do either:
	f.mu.Lock()
	already := f.fetched[url]
	f.fetched[url] = true
	f.mu.Unlock()
	
	if already {
		return
	}
	
	if depth &lt;= 0 {
		return
	}
	body, urls, err := fetcher.Fetch(url)
	if err != nil {
		fmt.Println(err)
		return
	}
	fmt.Printf(&quot;found: %s %q\n&quot;, url, body)
	
	var done sync.WaitGroup
	for _, u := range urls {
		done.Add(1)
		go func(url string) {
			defer done.Done()
			fmt.Printf(&quot;enter: %s\n&quot;, u)
			Crawl(u, depth-1, fetcher, f)
		}(u)
	}
	done.Wait()
	
	return
}

func makeState() *fetchState{
	f := &amp;fetchState{}
	f.fetched = make(map[string]bool)
	return f
}

func main() {
	Crawl(&quot;https://golang.org/&quot;, 4, fetcher, makeState())
}

// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
	body string
	urls []string
}

func (f fakeFetcher) Fetch(url string) (string, []string, error) {
	if res, ok := f[url]; ok {
		return res.body, res.urls, nil
	}
	return &quot;&quot;, nil, fmt.Errorf(&quot;not found: %s&quot;, url)
}

// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
	&quot;https://golang.org/&quot;: &amp;fakeResult{
		&quot;The Go Programming Language&quot;,
		[]string{
			&quot;https://golang.org/pkg/&quot;,
			&quot;https://golang.org/cmd/&quot;,
		},
	},
	&quot;https://golang.org/pkg/&quot;: &amp;fakeResult{
		&quot;Packages&quot;,
		[]string{
			&quot;https://golang.org/&quot;,
			&quot;https://golang.org/cmd/&quot;,
			&quot;https://golang.org/pkg/fmt/&quot;,
			&quot;https://golang.org/pkg/os/&quot;,
		},
	},
	&quot;https://golang.org/pkg/fmt/&quot;: &amp;fakeResult{
		&quot;Package fmt&quot;,
		[]string{
			&quot;https://golang.org/&quot;,
			&quot;https://golang.org/pkg/&quot;,
		},
	},
	&quot;https://golang.org/pkg/os/&quot;: &amp;fakeResult{
		&quot;Package os&quot;,
		[]string{
			&quot;https://golang.org/&quot;,
			&quot;https://golang.org/pkg/&quot;,
		},
	},
}

答案1

得分: 2

欢迎来到Stack Overflow!

在你的函数中,你将url定义为参数,但在其中一直使用u。循环变量u被函数字面量捕获。

尝试这样做:

var done sync.WaitGroup
for _, u := range urls {
    done.Add(1)
    go func(url string) {
        defer done.Done()
        fmt.Printf("enter: %s\n", url)  // <- 检查区别
        Crawl(url, depth-1, fetcher, f) // <- 检查区别
    }(u)
}
done.Wait()

关于为什么u变量打印相同的值,这是一个非常常见的错误:https://github.com/golang/go/wiki/CommonMistakes#using-goroutines-on-loop-iterator-variables

简而言之,Go语言通过引用将单个变量传递给goroutine。当它们执行时,它们可能会在其中找到迭代的最后一个值。

我找到了一篇详细解释的好文章:https://eli.thegreenplace.net/2019/go-internals-capturing-loop-variables-in-closures/

英文:

Welcome to Stack Overflow!

In you function, you defined url as the parameter, but kept using u inside of it.
The loop variable u captured by func literal.

Try doing this:

	var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
go func(url string) {
defer done.Done()
fmt.Printf(&quot;enter: %s\n&quot;, url)  // &lt;- check the difference
Crawl(url, depth-1, fetcher, f) // &lt;- check the difference
}(u)
}
done.Wait()

For why the same value was being printed with the u variable, this is a very common mistake: https://github.com/golang/go/wiki/CommonMistakes#using-goroutines-on-loop-iterator-variables

In short, the go is passing a single variable by reference to the goroutines. When they execute, they are probably going to find the last value of the iteration in it.

I found this neat article that explains it in detail: https://eli.thegreenplace.net/2019/go-internals-capturing-loop-variables-in-closures/

huangapple
  • 本文由 发表于 2022年8月26日 05:47:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/73493998.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定