2012年9月1日 12:46:11go评论86阅读模式

英文:

Exercise: Web Crawler - concurrency not working

问题

我正在进行golang之旅，并且正在完成最后一个练习，将一个网络爬虫改为并行爬取，并且不重复爬取（http://tour.golang.org/#73）。我所做的改变只是crawl函数。

    var used = make(map[string]bool)

    func Crawl(url string, depth int, fetcher Fetcher) {
    	if depth <= 0 {
        	return
    	}
        body, urls, err := fetcher.Fetch(url)
    	if err != nil {
	        fmt.Println(err)
	        return
        }
    	fmt.Printf("\nfound: %s %q\n\n", url, body)
        for _,u := range urls {
            if used[u] == false {
                used[u] = true
                go Crawl(u, depth-1, fetcher)
            }
        }
        return
    }

为了使其并发，我在调用Crawl函数之前添加了go命令，但是程序只能找到"http://golang.org/"页面，而没有其他页面。

为什么在调用Crawl函数时添加go命令后程序无法工作？

英文:

I am going through the golang tour and working on the final exercise to change a web crawler to crawl in parallel and not repeat a crawl ( http://tour.golang.org/#73 ). All I have changed is the crawl function.

    var used = make(map[string]bool)

    func Crawl(url string, depth int, fetcher Fetcher) {
    	if depth &lt;= 0 {
        	return
    	}
        body, urls, err := fetcher.Fetch(url)
    	if err != nil {
	        fmt.Println(err)
	        return
        }
    	fmt.Printf(&quot;\nfound: %s %q\n\n&quot;, url, body)
        for _,u := range urls {
            if used[u] == false {
                used[u] = true
                Crawl(u, depth-1, fetcher)
            }
        }
        return
    }

In order to make it concurrent I added the go command in front of the call to the function Crawl, but instead of recursively calling the Crawl function the program only finds the "http://golang.org/" page and no other pages.

Why doesn't the program work when I add the go command to the call of the function Crawl?

答案1

得分: 9

问题似乎是你的进程在爬虫能够跟随所有URL之前就退出了。由于并发性，main()过程在工作完成之前就退出了。

为了解决这个问题，你可以使用sync.WaitGroup：

func Crawl(url string, depth int, fetcher Fetcher, wg *sync.WaitGroup) {
    defer wg.Done()
    if depth <= 0 {
         return
    }
    body, urls, err := fetcher.Fetch(url)
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Printf("\nfound: %s %q\n\n", url, body)
    for _,u := range urls {
        if used[u] == false {
           used[u] = true
           wg.Add(1)
           go Crawl(u, depth-1, fetcher, wg)
        }
    }
    return
}

And call `Crawl` in `main` as follows:


func main() {
    wg := &sync.WaitGroup{}
    
    Crawl("http://golang.org/", 4, fetcher, wg)
    
    wg.Wait()
}

Also, [don't rely on the map being thread safe](http://golang.org/doc/go_faq.html#atomic_maps).

英文:

The problem seems to be, that your process is exiting before all URLs can be followed
by the crawler. Because of the concurrency, the main() procedure is exiting before
the workers are finished.

To circumvent this, you could use sync.WaitGroup:

func Crawl(url string, depth int, fetcher Fetcher, wg *sync.WaitGroup) {
    defer wg.Done()
    if depth &lt;= 0 {
         return
    }
    body, urls, err := fetcher.Fetch(url)
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Printf(&quot;\nfound: %s %q\n\n&quot;, url, body)
    for _,u := range urls {
        if used[u] == false {
           used[u] = true
           wg.Add(1)
           go Crawl(u, depth-1, fetcher, wg)
        }
    }
    return
}

And call Crawl in main as follows:

func main() {
	wg := &amp;sync.WaitGroup{}
	
	Crawl(&quot;http://golang.org/&quot;, 4, fetcher, wg)
	
	wg.Wait()
}

Also, don't rely on the map being thread safe.

答案2

得分: 2

这里是一种方法，再次使用sync.WaitGroup，但是将fetch函数包装在一个匿名的goroutine中。为了使url映射线程安全（意味着并行线程不能同时访问和更改值），应该将url映射包装在一个新类型中，其中包含sync.Mutex类型，即我的示例中的fetchedUrls类型，并在搜索/更新映射时使用Lock和Unlock方法。

type fetchedUrls struct {
	urls map[string]bool
	mux sync.Mutex
}

// Crawl使用fetcher递归爬取以url为起点的页面，最大深度为depth。
func Crawl(url string, depth int, fetcher Fetcher, used fetchedUrls, wg *sync.WaitGroup) {
	if depth <= 0 {
		return
	}
	used.mux.Lock()
	if used.urls[url] == false {
		used.urls[url] = true
		wg.Add(1)
		go func() {
			defer wg.Done()
			body, urls, err := fetcher.Fetch(url)
			if err != nil {
				fmt.Println(err)
				return
			}
			fmt.Printf("found: %s %q\n", url, body)
			for _, u := range urls {
				Crawl(u, depth-1, fetcher, used, wg)
			}
			return
		}()
	}
	used.mux.Unlock()
	return
}

func main() {
	wg := &sync.WaitGroup{}
	used := fetchedUrls{urls: make(map[string]bool)}
	Crawl("https://golang.org/", 4, fetcher, used, wg)
	wg.Wait()
}

输出：

found: https://golang.org/ "The Go Programming Language"
not found: https://golang.org/cmd/
found: https://golang.org/pkg/ "Packages"
found: https://golang.org/pkg/os/ "Package os"
found: https://golang.org/pkg/fmt/ "Package fmt"

程序退出。

  [1]: http://golang.org/pkg/sync/#WaitGroup
  [2]: https://golang.org/pkg/sync/#Mutex

英文:

Here's an approach, again using sync.WaitGroup but wrapping the fetch function in a anonymous goroutine. To make the url map thread safe (meaning parallel threads can't access and change values at the same time) one should wrap the url map in a new type with a sync.Mutex type included i.e. the fetchedUrls type in my example and use the Lock and Unlock methods while the map is being searched/updated.

type fetchedUrls struct {
	urls map[string]bool
	mux sync.Mutex
}

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, used fetchedUrls, wg *sync.WaitGroup) {
	if depth &lt;= 0 {
		return
	}
	used.mux.Lock()
	if used.urls
 == false {
		used.urls
 = true
		wg.Add(1)
		go func() {
			defer wg.Done()
			body, urls, err := fetcher.Fetch(url)
			if err != nil {
				fmt.Println(err)
				return
			}
			fmt.Printf(&quot;found: %s %q\n&quot;, url, body)
			for _, u := range urls {
				Crawl(u, depth-1, fetcher, used, wg)
			}
			return
		}()
	}
	used.mux.Unlock()
	return
}

func main() {
	wg := &amp;sync.WaitGroup{}
	used := fetchedUrls{urls: make(map[string]bool)}
	Crawl(&quot;https://golang.org/&quot;, 4, fetcher, used, wg)
	wg.Wait()
}

Output:

found: https://golang.org/ &quot;The Go Programming Language&quot;
not found: https://golang.org/cmd/
found: https://golang.org/pkg/ &quot;Packages&quot;
found: https://golang.org/pkg/os/ &quot;Package os&quot;
found: https://golang.org/pkg/fmt/ &quot;Package fmt&quot;

Program exited.

答案3

得分: 0

我创建了两个实现（不同的并发设计）的相同这里。

它还使用了一个线程安全的映射

playground链接

英文:

I created my 2 implementations(different concurrency designs) of the same here.

it also uses a thread-safe map

playground link

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

练习：网络爬虫 – 并发不起作用

问题

答案1

答案2

答案3

Go – 如何从字符串设置 RSA 公钥模数？

测试无法通过？

handler.ServeHTTP(w,req) and handler(w,req) difference in tests

如何在不重新解释现有时间戳的情况下设置时区？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论