英文:
Exercise: Web Crawler - concurrency not working
问题
我正在进行golang之旅,并且正在完成最后一个练习,将一个网络爬虫改为并行爬取,并且不重复爬取(http://tour.golang.org/#73)。我所做的改变只是crawl函数。
var used = make(map[string]bool)
func Crawl(url string, depth int, fetcher Fetcher) {
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("\nfound: %s %q\n\n", url, body)
for _,u := range urls {
if used[u] == false {
used[u] = true
go Crawl(u, depth-1, fetcher)
}
}
return
}
为了使其并发,我在调用Crawl函数之前添加了go命令,但是程序只能找到"http://golang.org/"页面,而没有其他页面。
为什么在调用Crawl函数时添加go命令后程序无法工作?
英文:
I am going through the golang tour and working on the final exercise to change a web crawler to crawl in parallel and not repeat a crawl ( http://tour.golang.org/#73 ). All I have changed is the crawl function.
var used = make(map[string]bool)
func Crawl(url string, depth int, fetcher Fetcher) {
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("\nfound: %s %q\n\n", url, body)
for _,u := range urls {
if used[u] == false {
used[u] = true
Crawl(u, depth-1, fetcher)
}
}
return
}
In order to make it concurrent I added the go command in front of the call to the function Crawl, but instead of recursively calling the Crawl function the program only finds the "http://golang.org/" page and no other pages.
Why doesn't the program work when I add the go command to the call of the function Crawl?
答案1
得分: 9
问题似乎是你的进程在爬虫能够跟随所有URL之前就退出了。由于并发性,main()
过程在工作完成之前就退出了。
为了解决这个问题,你可以使用sync.WaitGroup
:
func Crawl(url string, depth int, fetcher Fetcher, wg *sync.WaitGroup) {
defer wg.Done()
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("\nfound: %s %q\n\n", url, body)
for _,u := range urls {
if used[u] == false {
used[u] = true
wg.Add(1)
go Crawl(u, depth-1, fetcher, wg)
}
}
return
}
And call `Crawl` in `main` as follows:
func main() {
wg := &sync.WaitGroup{}
Crawl("http://golang.org/", 4, fetcher, wg)
wg.Wait()
}
Also, [don't rely on the map being thread safe](http://golang.org/doc/go_faq.html#atomic_maps).
英文:
The problem seems to be, that your process is exiting before all URLs can be followed
by the crawler. Because of the concurrency, the main()
procedure is exiting before
the workers are finished.
To circumvent this, you could use sync.WaitGroup
:
func Crawl(url string, depth int, fetcher Fetcher, wg *sync.WaitGroup) {
defer wg.Done()
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("\nfound: %s %q\n\n", url, body)
for _,u := range urls {
if used[u] == false {
used[u] = true
wg.Add(1)
go Crawl(u, depth-1, fetcher, wg)
}
}
return
}
And call Crawl
in main
as follows:
func main() {
wg := &sync.WaitGroup{}
Crawl("http://golang.org/", 4, fetcher, wg)
wg.Wait()
}
答案2
得分: 2
这里是一种方法,再次使用sync.WaitGroup,但是将fetch函数包装在一个匿名的goroutine中。为了使url映射线程安全(意味着并行线程不能同时访问和更改值),应该将url映射包装在一个新类型中,其中包含sync.Mutex类型,即我的示例中的fetchedUrls
类型,并在搜索/更新映射时使用Lock
和Unlock
方法。
type fetchedUrls struct {
urls map[string]bool
mux sync.Mutex
}
// Crawl使用fetcher递归爬取以url为起点的页面,最大深度为depth。
func Crawl(url string, depth int, fetcher Fetcher, used fetchedUrls, wg *sync.WaitGroup) {
if depth <= 0 {
return
}
used.mux.Lock()
if used.urls[url] == false {
used.urls[url] = true
wg.Add(1)
go func() {
defer wg.Done()
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
Crawl(u, depth-1, fetcher, used, wg)
}
return
}()
}
used.mux.Unlock()
return
}
func main() {
wg := &sync.WaitGroup{}
used := fetchedUrls{urls: make(map[string]bool)}
Crawl("https://golang.org/", 4, fetcher, used, wg)
wg.Wait()
}
输出:
found: https://golang.org/ "The Go Programming Language"
not found: https://golang.org/cmd/
found: https://golang.org/pkg/ "Packages"
found: https://golang.org/pkg/os/ "Package os"
found: https://golang.org/pkg/fmt/ "Package fmt"
程序退出。
[1]: http://golang.org/pkg/sync/#WaitGroup
[2]: https://golang.org/pkg/sync/#Mutex
英文:
Here's an approach, again using sync.WaitGroup but wrapping the fetch function in a anonymous goroutine. To make the url map thread safe (meaning parallel threads can't access and change values at the same time) one should wrap the url map in a new type with a sync.Mutex type included i.e. the fetchedUrls
type in my example and use the Lock
and Unlock
methods while the map is being searched/updated.
type fetchedUrls struct {
urls map[string]bool
mux sync.Mutex
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, used fetchedUrls, wg *sync.WaitGroup) {
if depth <= 0 {
return
}
used.mux.Lock()
if used.urls == false {
used.urls = true
wg.Add(1)
go func() {
defer wg.Done()
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
Crawl(u, depth-1, fetcher, used, wg)
}
return
}()
}
used.mux.Unlock()
return
}
func main() {
wg := &sync.WaitGroup{}
used := fetchedUrls{urls: make(map[string]bool)}
Crawl("https://golang.org/", 4, fetcher, used, wg)
wg.Wait()
}
Output:
found: https://golang.org/ "The Go Programming Language"
not found: https://golang.org/cmd/
found: https://golang.org/pkg/ "Packages"
found: https://golang.org/pkg/os/ "Package os"
found: https://golang.org/pkg/fmt/ "Package fmt"
Program exited.
答案3
得分: 0
我创建了两个实现(不同的并发设计)的相同这里。
它还使用了一个线程安全的映射
英文:
I created my 2 implementations(different concurrency designs) of the same here.
it also uses a thread-safe map
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论