练习:网络爬虫 – 并发不起作用

huangapple go评论86阅读模式
英文:

Exercise: Web Crawler - concurrency not working

问题

我正在进行golang之旅,并且正在完成最后一个练习,将一个网络爬虫改为并行爬取,并且不重复爬取(http://tour.golang.org/#73)。我所做的改变只是crawl函数。

    var used = make(map[string]bool)

    func Crawl(url string, depth int, fetcher Fetcher) {
    	if depth <= 0 {
        	return
    	}
        body, urls, err := fetcher.Fetch(url)
    	if err != nil {
	        fmt.Println(err)
	        return
        }
    	fmt.Printf("\nfound: %s %q\n\n", url, body)
        for _,u := range urls {
            if used[u] == false {
                used[u] = true
                go Crawl(u, depth-1, fetcher)
            }
        }
        return
    }

为了使其并发,我在调用Crawl函数之前添加了go命令,但是程序只能找到"http://golang.org/"页面,而没有其他页面。

为什么在调用Crawl函数时添加go命令后程序无法工作?

英文:

I am going through the golang tour and working on the final exercise to change a web crawler to crawl in parallel and not repeat a crawl ( http://tour.golang.org/#73 ). All I have changed is the crawl function.

    var used = make(map[string]bool)

    func Crawl(url string, depth int, fetcher Fetcher) {
    	if depth &lt;= 0 {
        	return
    	}
        body, urls, err := fetcher.Fetch(url)
    	if err != nil {
	        fmt.Println(err)
	        return
        }
    	fmt.Printf(&quot;\nfound: %s %q\n\n&quot;, url, body)
        for _,u := range urls {
            if used[u] == false {
                used[u] = true
                Crawl(u, depth-1, fetcher)
            }
        }
        return
    }

In order to make it concurrent I added the go command in front of the call to the function Crawl, but instead of recursively calling the Crawl function the program only finds the "http://golang.org/" page and no other pages.

Why doesn't the program work when I add the go command to the call of the function Crawl?

答案1

得分: 9

问题似乎是你的进程在爬虫能够跟随所有URL之前就退出了。由于并发性,main()过程在工作完成之前就退出了。

为了解决这个问题,你可以使用sync.WaitGroup

func Crawl(url string, depth int, fetcher Fetcher, wg *sync.WaitGroup) {
    defer wg.Done()
    if depth <= 0 {
         return
    }
    body, urls, err := fetcher.Fetch(url)
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Printf("\nfound: %s %q\n\n", url, body)
    for _,u := range urls {
        if used[u] == false {
           used[u] = true
           wg.Add(1)
           go Crawl(u, depth-1, fetcher, wg)
        }
    }
    return
}

And call `Crawl` in `main` as follows:


func main() {
    wg := &sync.WaitGroup{}
    
    Crawl("http://golang.org/", 4, fetcher, wg)
    
    wg.Wait()
}

Also, [don't rely on the map being thread safe](http://golang.org/doc/go_faq.html#atomic_maps).
英文:

The problem seems to be, that your process is exiting before all URLs can be followed
by the crawler. Because of the concurrency, the main() procedure is exiting before
the workers are finished.

To circumvent this, you could use sync.WaitGroup:

func Crawl(url string, depth int, fetcher Fetcher, wg *sync.WaitGroup) {
    defer wg.Done()
    if depth &lt;= 0 {
         return
    }
    body, urls, err := fetcher.Fetch(url)
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Printf(&quot;\nfound: %s %q\n\n&quot;, url, body)
    for _,u := range urls {
        if used[u] == false {
           used[u] = true
           wg.Add(1)
           go Crawl(u, depth-1, fetcher, wg)
        }
    }
    return
}

And call Crawl in main as follows:

func main() {
	wg := &amp;sync.WaitGroup{}
	
	Crawl(&quot;http://golang.org/&quot;, 4, fetcher, wg)
	
	wg.Wait()
}

Also, don't rely on the map being thread safe.

答案2

得分: 2

这里是一种方法,再次使用sync.WaitGroup,但是将fetch函数包装在一个匿名的goroutine中。为了使url映射线程安全(意味着并行线程不能同时访问和更改值),应该将url映射包装在一个新类型中,其中包含sync.Mutex类型,即我的示例中的fetchedUrls类型,并在搜索/更新映射时使用LockUnlock方法。

type fetchedUrls struct {
	urls map[string]bool
	mux sync.Mutex
}

// Crawl使用fetcher递归爬取以url为起点的页面,最大深度为depth。
func Crawl(url string, depth int, fetcher Fetcher, used fetchedUrls, wg *sync.WaitGroup) {
	if depth <= 0 {
		return
	}
	used.mux.Lock()
	if used.urls[url] == false {
		used.urls[url] = true
		wg.Add(1)
		go func() {
			defer wg.Done()
			body, urls, err := fetcher.Fetch(url)
			if err != nil {
				fmt.Println(err)
				return
			}
			fmt.Printf("found: %s %q\n", url, body)
			for _, u := range urls {
				Crawl(u, depth-1, fetcher, used, wg)
			}
			return
		}()
	}
	used.mux.Unlock()
	return
}

func main() {
	wg := &sync.WaitGroup{}
	used := fetchedUrls{urls: make(map[string]bool)}
	Crawl("https://golang.org/", 4, fetcher, used, wg)
	wg.Wait()
}

输出:

found: https://golang.org/ "The Go Programming Language"
not found: https://golang.org/cmd/
found: https://golang.org/pkg/ "Packages"
found: https://golang.org/pkg/os/ "Package os"
found: https://golang.org/pkg/fmt/ "Package fmt"

程序退出。

  [1]: http://golang.org/pkg/sync/#WaitGroup
  [2]: https://golang.org/pkg/sync/#Mutex
英文:

Here's an approach, again using sync.WaitGroup but wrapping the fetch function in a anonymous goroutine. To make the url map thread safe (meaning parallel threads can't access and change values at the same time) one should wrap the url map in a new type with a sync.Mutex type included i.e. the fetchedUrls type in my example and use the Lock and Unlock methods while the map is being searched/updated.

type fetchedUrls struct {
	urls map[string]bool
	mux sync.Mutex
}

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, used fetchedUrls, wg *sync.WaitGroup) {
	if depth &lt;= 0 {
		return
	}
	used.mux.Lock()
	if used.urls
== false { used.urls
= true wg.Add(1) go func() { defer wg.Done() body, urls, err := fetcher.Fetch(url) if err != nil { fmt.Println(err) return } fmt.Printf(&quot;found: %s %q\n&quot;, url, body) for _, u := range urls { Crawl(u, depth-1, fetcher, used, wg) } return }() } used.mux.Unlock() return } func main() { wg := &amp;sync.WaitGroup{} used := fetchedUrls{urls: make(map[string]bool)} Crawl(&quot;https://golang.org/&quot;, 4, fetcher, used, wg) wg.Wait() }

Output:

found: https://golang.org/ &quot;The Go Programming Language&quot;
not found: https://golang.org/cmd/
found: https://golang.org/pkg/ &quot;Packages&quot;
found: https://golang.org/pkg/os/ &quot;Package os&quot;
found: https://golang.org/pkg/fmt/ &quot;Package fmt&quot;

Program exited.

答案3

得分: 0

我创建了两个实现(不同的并发设计)的相同这里

它还使用了一个线程安全的映射

playground链接

英文:

I created my 2 implementations(different concurrency designs) of the same here.

it also uses a thread-safe map

playground link

huangapple
  • 本文由 发表于 2012年9月1日 12:46:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/12224962.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定