英文:
Goroutine in for loop causes unexpected behavior
问题
我正在为你翻译以下内容:
我正在完成《Go之旅》中的Web爬虫练习。
我尝试使用并发的互斥锁(Mutex)来解决问题,参考了这里找到的一个解决方案。我对其进行了修改,以适应原始问题中的预定义签名。然而,在URL树的第二层时,爬虫停止了。在调试过程中,打印语句的不同行为完全让我困惑了:
var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
fmt.Printf("enter: %s\n", u) // 这里
go func(url string) {
defer done.Done()
Crawl(u, depth-1, fetcher, f)
}(u)
}
done.Wait()
如果我将打印语句放在goroutine之外,输出是符合预期的。但我不知道为什么会停在那里。
enter: https://golang.org/pkg/
enter: https://golang.org/cmd/
但是,如果我将打印语句放在goroutine内部,即
var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
go func(url string) {
defer done.Done()
fmt.Printf("enter: %s\n", u) // 这里
Crawl(u, depth-1, fetcher, f)
}(u)
}
done.Wait()
输出变成了
enter: https://golang.org/cmd/
enter: https://golang.org/cmd/
我有两个问题:
- 在第二种情况下,为什么会打印两次
enter: https://golang.org/cmd/
? - 为什么Crawl函数会在出现错误时停止,而不是继续遍历URL树?
PS:第二个问题可能与第一个问题有关。我故意在goroutine内部将 u
改为 url
,以重现困扰我的错误。
以下是我修改后的解决方案:
package main
import (
"fmt"
"sync"
)
type Fetcher interface {
// Fetch返回URL的内容和在该页面上找到的URL切片。
Fetch(url string) (body string, urls []string, err error)
}
type fetchState struct {
mu sync.Mutex
fetched map[string]bool
}
// Crawl使用fetcher递归地爬取以url为起点的页面,最大深度为depth。
func Crawl(url string, depth int, fetcher Fetcher, f *fetchState) {
// TODO:并行获取URL。
// TODO:不要重复获取相同的URL。
// 这个实现两者都没有做到:
f.mu.Lock()
already := f.fetched[url]
f.fetched[url] = true
f.mu.Unlock()
if already {
return
}
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
go func(url string) {
defer done.Done()
fmt.Printf("enter: %s\n", u)
Crawl(u, depth-1, fetcher, f)
}(u)
}
done.Wait()
return
}
func makeState() *fetchState {
f := &fetchState{}
f.fetched = make(map[string]bool)
return f
}
func main() {
Crawl("https://golang.org/", 4, fetcher, makeState())
}
// fakeFetcher是一个返回预定义结果的Fetcher。
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher是一个填充了预定义结果的fakeFetcher。
var fetcher = fakeFetcher{
"https://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"https://golang.org/pkg/",
"https://golang.org/cmd/",
},
},
"https://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"https://golang.org/",
"https://golang.org/cmd/",
"https://golang.org/pkg/fmt/",
"https://golang.org/pkg/os/",
},
},
"https://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
"https://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
}
英文:
I was doing the Web Crawler Exercise in A Tour of Go.
I was trying to use concurrent Mutex to solve the question, based on a solution found here. I modified it to fit the pre-defined signatures in the original question. However, the crawler stops at the second level of the URL tree. During debugging, the different behaviors of the print statements completely confused me:
var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
fmt.Printf("enter: %s\n", u) // here
go func(url string) {
defer done.Done()
Crawl(u, depth-1, fetcher, f)
}(u)
}
done.Wait()
If I put the print statement outside the goroutine, outputs are expected. But I didn't know why it stops there.
enter: https://golang.org/pkg/
enter: https://golang.org/cmd/
But if I put the print statement inside the goroutine, that is
var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
go func(url string) {
defer done.Done()
fmt.Printf("enter: %s\n", u) // here
Crawl(u, depth-1, fetcher, f)
}(u)
}
done.Wait()
The output becomes
enter: https://golang.org/cmd/
enter: https://golang.org/cmd/
I have two questions:
- In the second case, why
enter: https://golang.org/cmd/
gets printed twice? - Why does the Crawl function stop at an error, instead of keeping traversing the URL tree?
PS: the second question might be related to the first one. I intentionally made u
instead of url
inside the goroutine to reproduce the bug that confused me.
Below is my modified solution
package main
import (
"fmt"
"sync"
)
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
type fetchState struct {
mu sync.Mutex
fetched map[string]bool
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, f *fetchState) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
f.mu.Lock()
already := f.fetched[url]
f.fetched[url] = true
f.mu.Unlock()
if already {
return
}
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
go func(url string) {
defer done.Done()
fmt.Printf("enter: %s\n", u)
Crawl(u, depth-1, fetcher, f)
}(u)
}
done.Wait()
return
}
func makeState() *fetchState{
f := &fetchState{}
f.fetched = make(map[string]bool)
return f
}
func main() {
Crawl("https://golang.org/", 4, fetcher, makeState())
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"https://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"https://golang.org/pkg/",
"https://golang.org/cmd/",
},
},
"https://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"https://golang.org/",
"https://golang.org/cmd/",
"https://golang.org/pkg/fmt/",
"https://golang.org/pkg/os/",
},
},
"https://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
"https://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
}
答案1
得分: 2
欢迎来到Stack Overflow!
在你的函数中,你将url
定义为参数,但在其中一直使用u
。循环变量u
被函数字面量捕获。
尝试这样做:
var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
go func(url string) {
defer done.Done()
fmt.Printf("enter: %s\n", url) // <- 检查区别
Crawl(url, depth-1, fetcher, f) // <- 检查区别
}(u)
}
done.Wait()
关于为什么u
变量打印相同的值,这是一个非常常见的错误:https://github.com/golang/go/wiki/CommonMistakes#using-goroutines-on-loop-iterator-variables
简而言之,Go语言通过引用将单个变量传递给goroutine。当它们执行时,它们可能会在其中找到迭代的最后一个值。
我找到了一篇详细解释的好文章:https://eli.thegreenplace.net/2019/go-internals-capturing-loop-variables-in-closures/
英文:
Welcome to Stack Overflow!
In you function, you defined url
as the parameter, but kept using u
inside of it.
The loop variable u captured by func literal.
Try doing this:
var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
go func(url string) {
defer done.Done()
fmt.Printf("enter: %s\n", url) // <- check the difference
Crawl(url, depth-1, fetcher, f) // <- check the difference
}(u)
}
done.Wait()
For why the same value was being printed with the u
variable, this is a very common mistake: https://github.com/golang/go/wiki/CommonMistakes#using-goroutines-on-loop-iterator-variables
In short, the go is passing a single variable by reference to the goroutines. When they execute, they are probably going to find the last value of the iteration in it.
I found this neat article that explains it in detail: https://eli.thegreenplace.net/2019/go-internals-capturing-loop-variables-in-closures/
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论