Go网络爬虫卡住了。

huangapple go评论83阅读模式
英文:

Go web crawler gets stuck

问题

我是你的中文翻译助手,以下是你提供的代码的翻译:

我是Go的新手正在尝试实现一个网络爬虫它应该异步解析网页并将其内容保存到文件中每个新页面保存为一个文件但是在我添加了以下代码后它卡住了

    u, _ := url.Parse(uri)
    fileName := u.Host + u.RawQuery + ".html"
    body, err := ioutil.ReadAll(resp.Body)
    writes <- writer{fileName: fileName, body: body}

有人可以帮我解决这个问题吗基本上我想从响应体中获取数据将其推送到通道中然后从通道中获取数据并将其放入文件中
看起来`writes`通道没有被初始化向一个空通道发送数据会导致永远阻塞

package main

import (
	"crypto/tls"
	"flag"
	"fmt"
	"io/ioutil"
	"net/http"
	"net/url"
	"os"
	"runtime"

	"./linksCollector"
)

type writer struct {
	fileName string
	body     []byte
}

var writes = make(chan writer)

func usage() {
	fmt.Fprintf(os.Stderr, "usage: crawl http://example.com/")
	flag.PrintDefaults()
	os.Exit(2)
}

func check(e error) {
	if e != nil {
		panic(e)
	}
}

func main() {
	runtime.GOMAXPROCS(8)
	flag.Usage = usage
	flag.Parse()

	args := flag.Args()
	fmt.Println(args)
	if len(args) < 1 {
		usage()
		fmt.Println("Please specify start page")
		os.Exit(1)
	}

	queue := make(chan string)
	filteredQueue := make(chan string)

	go func() { queue <- args[0] }()
	go filterQueue(queue, filteredQueue)

	for uri := range filteredQueue {
		go enqueue(uri, queue)
	}

	for {
		select {
		case data := <-writes:
			f, err := os.Create(data.fileName)
			check(err)
			defer f.Close()
			_, err = f.Write(data.body)
			check(err)
		}
	}
}

func filterQueue(in chan string, out chan string) {
	var seen = make(map[string]bool)
	for val := range in {
		if !seen[val] {
			seen[val] = true
			out <- val
		}
	}
}

func enqueue(uri string, queue chan string) {
	fmt.Println("fetching", uri)
	transport := &http.Transport{
		TLSClientConfig: &tls.Config{
			InsecureSkipVerify: true,
		},
	}
	client := http.Client{Transport: transport}
	resp, err := client.Get(uri)
	check(err)

	defer resp.Body.Close()

	u, _ := url.Parse(uri)
	fileName := u.Host + u.RawQuery + ".html";
	body, err := ioutil.ReadAll(resp.Body)
	writes <- writer{fileName: fileName, body: body}

	links := collectlinks.All(resp.Body)

	for _, link := range links {
		absolute := fixURL(link, uri)
		if uri != "" {
			go func() { queue <- absolute }()
		}
	}
}

func fixURL(href, base string) string {
	uri, err := url.Parse(href)
	if err != nil {
		return ""
	}
	baseURL, err := url.Parse(base)
	if err != nil {
		return ""
	}
	uri = baseURL.ResolveReference(uri)
	return uri.String()
}

希望对你有帮助!如果你有任何其他问题,请随时问我。

英文:

I'm new to Go and trying to implement a web crawler. It should asynchronously parse web pages and save their contents to files, one file per new page. But it gets stuck after I've added

u, _ := url.Parse(uri)
fileName := u.Host + u.RawQuery + &quot;.html&quot;
body, err := ioutil.ReadAll(resp.Body)
writes &lt;- writer{fileName: fileName, body: body}

Can anyone help me fix this problem? Basically I want to get data from the response body, push it to the channel, and then get data from the channel and put it into a file.
It looks like the writes channel was not initialized, and sending on a nil channel blocks forever.

package main
import (
&quot;crypto/tls&quot;
&quot;flag&quot;
&quot;fmt&quot;
&quot;io/ioutil&quot;
&quot;net/http&quot;
&quot;net/url&quot;
&quot;os&quot;
&quot;runtime&quot;
&quot;./linksCollector&quot;
)
type writer struct {
fileName string
body     []byte
}
var writes = make(chan writer)
func usage() {
fmt.Fprintf(os.Stderr, &quot;usage: crawl http://example.com/&quot;)
flag.PrintDefaults()
os.Exit(2)
}
func check(e error) {
if e != nil {
panic(e)
}
}
func main() {
runtime.GOMAXPROCS(8)
flag.Usage = usage
flag.Parse()
args := flag.Args()
fmt.Println(args)
if len(args) &lt; 1 {
usage()
fmt.Println(&quot;Please specify start page&quot;)
os.Exit(1)
}
queue := make(chan string)
filteredQueue := make(chan string)
go func() { queue &lt;- args[0] }()
go filterQueue(queue, filteredQueue)
for uri := range filteredQueue {
go enqueue(uri, queue)
}
for {
select {
case data := &lt;-writes:
f, err := os.Create(data.fileName)
check(err)
defer f.Close()
_, err = f.Write(data.body)
check(err)
}
}
}
func filterQueue(in chan string, out chan string) {
var seen = make(map[string]bool)
for val := range in {
if !seen[val] {
seen[val] = true
out &lt;- val
}
}
}
func enqueue(uri string, queue chan string) {
fmt.Println(&quot;fetching&quot;, uri)
transport := &amp;http.Transport{
TLSClientConfig: &amp;tls.Config{
InsecureSkipVerify: true,
},
}
client := http.Client{Transport: transport}
resp, err := client.Get(uri)
check(err)
defer resp.Body.Close()
u, _ := url.Parse(uri)
fileName := u.Host + u.RawQuery + &quot;.html&quot;
body, err := ioutil.ReadAll(resp.Body)
writes &lt;- writer{fileName: fileName, body: body}
links := collectlinks.All(resp.Body)
for _, link := range links {
absolute := fixURL(link, uri)
if uri != &quot;&quot; {
go func() { queue &lt;- absolute }()
}
}
}
func fixURL(href, base string) string {
uri, err := url.Parse(href)
if err != nil {
return &quot;&quot;
}
baseURL, err := url.Parse(base)
if err != nil {
return &quot;&quot;
}
uri = baseURL.ResolveReference(uri)
return uri.String()
}

答案1

得分: 1

你的for循环在select接收数据之前调用了多次go enqueue,导致发送到writes的发送操作使程序崩溃。我认为,我对Go的并发性不是很熟悉。

更新:对于之前的回答,我很抱歉,那是一个对我所知有限的事情的错误解释尝试。经过仔细查看,我几乎可以确定两件事。**1.**你的writes通道不是nil,你可以依赖make来初始化你的通道。**2.**对通道的range循环将会阻塞,直到该通道关闭。所以你的

for uri := range filteredQueue {
go enqueue(uri, queue)
}

是阻塞的,因此你的程序永远不会到达select,因此无法从writes通道接收数据。你可以通过在新的goroutine中执行range循环来避免这个问题。

go func() {
for uri := range filteredQueue {
go enqueue(uri, queue)
}
}()

你的程序目前仍然会因为其他原因而出错,但你可以通过使用sync.WaitGroup进行一些同步来修复这个问题。这里有一个简化的示例:https://play.golang.org/p/o2Oj4g8c2y。

英文:

<strike>Your for loop ends up calling go enqueue more than once before the select receives the data causing the send to writes to crash the program, I think, I'm not really that familiar with Go's concurrency.</strike>

Update: I'm sorry for the previous answer, it was a poorly informed attempt at explaining something I have only limited knowledge about. After taking a closer look I am almost certain of two things. 1. Your writes channel is not nil, you can rely on make to initilize your channels. 2. A range loop over a channel will block until that channel is closed. So your

for uri := range filteredQueue {
go enqueue(uri, queue)
}

is blocking, therefore your program never reaches the select and so is unable to receive from the writes channel. You can avoid this by executing the range loop in a new goroutine.

go func() {
for uri := range filteredQueue {
go enqueue(uri, queue)
}
}()

Your program, as is, will still break for other reasons but you should be able to fix that with a little synchronization using a sync.WaitGroup.
Here's a simplified example: https://play.golang.org/p/o2Oj4g8c2y.

huangapple
  • 本文由 发表于 2017年3月29日 18:29:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/43091005.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定