golang tour网页爬虫练习的简单解决方案

huangapple go评论86阅读模式
英文:

Simple solution for golang tour webcrawler exercise

问题

我是Go语言的新手,看到了一些关于这个练习的解决方案,但我觉得它们很复杂...

在我的解决方案中,一切都很简单,但是我遇到了死锁错误。我无法弄清楚如何正确关闭通道并停止主块内的循环。有没有简单的方法来做到这一点?

在Golang playground上的解决方案

感谢任何/所有提供帮助的人!

package main

import (
	"fmt"
	"sync"
)

type Fetcher interface {
	// Fetch returns the body of URL and
	// a slice of URLs found on that page.
	Fetch(url string) (body string, urls []string, err error)
}

type SafeCache struct {
	cache map[string]bool
	mux   sync.Mutex
}

func (c *SafeCache) Set(s string) {
	c.mux.Lock()
	c.cache[s] = true
	c.mux.Unlock()
}

func (c *SafeCache) Get(s string) bool {
	c.mux.Lock()
	defer c.mux.Unlock()
	return c.cache[s]
}

var (
	sc    = SafeCache{cache: make(map[string]bool)}
	errs  = make(chan error)
	ress  = make(chan string)
)

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
	if depth <= 0 {
		return
	}

	var (
		body string
		err  error
		urls []string
	)

	if ok := sc.Get(url); !ok {
		sc.Set(url)
		body, urls, err = fetcher.Fetch(url)
	} else {
		err = fmt.Errorf("Already fetched: %s", url)
	}

	if err != nil {
		errs <- err
		return
	}

	ress <- fmt.Sprintf("found: %s %q\n", url, body)
	for _, u := range urls {
		go Crawl(u, depth-1, fetcher)
	}
	return
}

func main() {
	go Crawl("http://golang.org/", 4, fetcher)
	for {
		select {
		case res, ok := <-ress:
			fmt.Println(res)
			if !ok {
				break
			}
		case err, ok := <-errs:
			fmt.Println(err)
			if !ok {
				break
			}
		}
	}
}

// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
	body string
	urls []string
}

func (f fakeFetcher) Fetch(url string) (string, []string, error) {
	if res, ok := f[url]; ok {
		return res.body, res.urls, nil
	}
	return "", nil, fmt.Errorf("not found: %s", url)
}

// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
	"http://golang.org/": &fakeResult{
		"The Go Programming Language",
		[]string{
			"http://golang.org/pkg/",
			"http://golang.org/cmd/",
		},
	},
	"http://golang.org/pkg/": &fakeResult{
		"Packages",
		[]string{
			"http://golang.org/",
			"http://golang.org/cmd/",
			"http://golang.org/pkg/fmt/",
			"http://golang.org/pkg/os/",
		},
	},
	"http://golang.org/pkg/fmt/": &fakeResult{
		"Package fmt",
		[]string{
			"http://golang.org/",
			"http://golang.org/pkg/",
		},
	},
	"http://golang.org/pkg/os/": &fakeResult{
		"Package os",
		[]string{
			"http://golang.org/",
			"http://golang.org/pkg/",
		},
	},
}
英文:

I'm new to Go and I saw some solutions for this exercise, but I think they are complex...

In my solution everything seems simple, but I've got a deadlock error. I can't figure out how to properly close channels and stop loop inside main block. Is there a simple way to do this?

Solution on Golang playground

Thanks for any/all help one may provide!

<!-- language: lang-golang -->

package main
import (
&quot;fmt&quot;
&quot;sync&quot;
)
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
type SafeCache struct {
cache map[string]bool
mux   sync.Mutex
}
func (c *SafeCache) Set(s string) {
c.mux.Lock()
c.cache
展开收缩
= true c.mux.Unlock() } func (c *SafeCache) Get(s string) bool { c.mux.Lock() defer c.mux.Unlock() return c.cache
展开收缩
} var ( sc = SafeCache{cache: make(map[string]bool)} errs, ress = make(chan error), make(chan string) ) // Crawl uses fetcher to recursively crawl // pages starting with url, to a maximum of depth. func Crawl(url string, depth int, fetcher Fetcher) { if depth &lt;= 0 { return } var ( body string err error urls []string ) if ok := sc.Get(url); !ok { sc.Set(url) body, urls, err = fetcher.Fetch(url) } else { err = fmt.Errorf(&quot;Already fetched: %s&quot;, url) } if err != nil { errs &lt;- err return } ress &lt;- fmt.Sprintf(&quot;found: %s %q\n&quot;, url, body) for _, u := range urls { go Crawl(u, depth-1, fetcher) } return } func main() { go Crawl(&quot;http://golang.org/&quot;, 4, fetcher) for { select { case res, ok := &lt;-ress: fmt.Println(res) if !ok { break } case err, ok := &lt;-errs: fmt.Println(err) if !ok { break } } } } // fakeFetcher is Fetcher that returns canned results. type fakeFetcher map[string]*fakeResult type fakeResult struct { body string urls []string } func (f fakeFetcher) Fetch(url string) (string, []string, error) { if res, ok := f
; ok { return res.body, res.urls, nil } return &quot;&quot;, nil, fmt.Errorf(&quot;not found: %s&quot;, url) } // fetcher is a populated fakeFetcher. var fetcher = fakeFetcher{ &quot;http://golang.org/&quot;: &amp;fakeResult{ &quot;The Go Programming Language&quot;, []string{ &quot;http://golang.org/pkg/&quot;, &quot;http://golang.org/cmd/&quot;, }, }, &quot;http://golang.org/pkg/&quot;: &amp;fakeResult{ &quot;Packages&quot;, []string{ &quot;http://golang.org/&quot;, &quot;http://golang.org/cmd/&quot;, &quot;http://golang.org/pkg/fmt/&quot;, &quot;http://golang.org/pkg/os/&quot;, }, }, &quot;http://golang.org/pkg/fmt/&quot;: &amp;fakeResult{ &quot;Package fmt&quot;, []string{ &quot;http://golang.org/&quot;, &quot;http://golang.org/pkg/&quot;, }, }, &quot;http://golang.org/pkg/os/&quot;: &amp;fakeResult{ &quot;Package os&quot;, []string{ &quot;http://golang.org/&quot;, &quot;http://golang.org/pkg/&quot;, }, }, }

答案1

得分: 2

你可以使用sync.WaitGroup来解决这个问题。

  1. 你可以在单独的goroutine中开始监听你的通道。
  2. WaitGroup将协调你有多少个goroutine。

wg.Add(1)表示我们将启动一个新的goroutine。

wg.Done()表示goroutine已经完成。

wg.Wait()会阻塞goroutine,直到所有启动的goroutine都完成。

这三个方法可以协调goroutine的执行。

Go playground链接

PS. 你可能对sync.RWMutex对于你的SafeCache感兴趣。

英文:

you can solve this with sync.WaitGroup

  1. You can start listening your channels in separate goroutines.
  2. WaitGroup will coordinate how many goroutines do you have.

wg.Add(1) says that we're going to start new goroutine.

wg.Done() says that goroutine is finished.

wg.Wait() blocks goroutine, until all started goroutines aren't finished yet.

This 3 methods allows you to coordinate goroutines.

Go playground link

PS. you might be interested in sync.RWMutex for your SafeCache

huangapple
  • 本文由 发表于 2017年3月9日 16:22:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/42690138.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定