网络爬虫在第一页停止

huangapple go评论78阅读模式
英文:

Web crawler stops at first page

问题

我正在开发一个网络爬虫,应该按照以下方式工作:

  1. 访问一个网站,抓取该网站上的所有链接
  2. 下载所有图片(从起始页面开始)
  3. 如果当前页面上没有剩余图片,转到步骤1中找到的下一个链接,并执行步骤2和3,直到没有链接/图片为止。

看起来下面的代码在某种程度上是有效的,比如当我尝试爬取一些网站时,我可以获取一些要下载的图片。

(尽管我不理解获取到的图片,因为我在网站上找不到它们,似乎爬虫没有从网站的起始页面开始)。

在获取了几张图片(约25-500张)之后,爬虫就停止了,没有错误,只是停止了。我尝试了多个网站,获取了几张图片后它就停止了。我认为爬虫在某种程度上忽略了步骤3。

package main

import (
	"fmt"
	"io"
	"log"
	"net/http"
	"os"
	"strconv"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

var (
	currWebsite  string = "https://www.youtube.com"
	imageCount   int    = 0
	crawlWebsite string
)

func processElement(index int, element *goquery.Selection) {
	href, exists := element.Attr("href")
	if exists && strings.HasPrefix(href, "http") {
		crawlWebsite = href
		response, err := http.Get(crawlWebsite)
		if err != nil {
			log.Fatalf("当前网站出错")
		}

		defer response.Body.Close()

		document, err := goquery.NewDocumentFromReader(response.Body)
		if err != nil {
			log.Fatal("加载HTTP响应正文时出错", err)
		}

		document.Find("img").Each(func(index int, element *goquery.Selection) {
			imgSrc, exists := element.Attr("src")
			if strings.HasPrefix(imgSrc, "http") && exists {
				fileName := fmt.Sprintf("./images/img" + strconv.Itoa(imageCount) + ".jpg")
				currWebsite := fmt.Sprint(imgSrc)
				fmt.Println("[+]", currWebsite)
				DownloadFile(fileName, currWebsite)
				imageCount++
			}
		})
	}
}

func main() {
	err := os.MkdirAll("./images/", 0777)
	if err != nil {
		log.Fatalln("创建目录时出错")
	}

	response, err := http.Get(currWebsite)
	if err != nil {
		log.Fatalln("搜索网站时出错")
	}

	defer response.Body.Close()

	document, err := goquery.NewDocumentFromReader(response.Body)
	if err != nil {
		log.Fatalln("加载HTTP响应正文时出错", err)
	}

	document.Find("a").Each(processElement)
}

func DownloadFile(filepath string, url string) {
	response, err := http.Get(url)
	if err != nil {
		log.Fatalln("获取网站信息时出错")
	}
	defer response.Body.Close()

	if response.StatusCode != 200 {
		log.Fatalln("收到非200响应代码")
	}

	file, err := os.Create(filepath)
	if err != nil {
		log.Fatalf("在 %v 处创建文件时出错\n", filepath)
	}

	defer file.Close()

	_, err = io.Copy(file, response.Body)
	if err != nil {
		log.Fatalln("从源文件复制文件时出错")
	}
}
英文:

I'm working on a webcrawler which should be working like this:

  1. go to a website, crawl all links from the site
  2. download all images (starting from the startpage)
  3. if there are no images left on the current page, go to the next link found in step 1 and do step 2 and 3 until there are no links/images left.

It seems like the code below is somehow working, like when I try to crawl some sites, I get some images to download.

(even I dont understand the images I get, cause I cant find them on the website, it seems like the crawler does not start with the startpage of the website).

After a few images (~25-500), the crawler is done and stops, no errors, it just stops. I tried this with multiple websites and after a few images it just stops. I think the crawler somehow ignores step 3.

package main
import (
"fmt"
"io"
"log"
"net/http"
"os"
"strconv"
"strings"
"github.com/PuerkitoBio/goquery"
)
var (
currWebsite  string = "https://www.youtube.com"
imageCount   int    = 0
crawlWebsite string
)
func processElement(index int, element *goquery.Selection) {
href, exists := element.Attr("href")
if exists && strings.HasPrefix(href, "http") {
crawlWebsite = href
response, err := http.Get(crawlWebsite)
if err != nil {
log.Fatalf("error on current website")
}
defer response.Body.Close()
document, err := goquery.NewDocumentFromReader(response.Body)
if err != nil {
log.Fatal("Error loading HTTP response body.", err)
}
document.Find("img").Each(func(index int, element *goquery.Selection) {
imgSrc, exists := element.Attr("src")
if strings.HasPrefix(imgSrc, "http") && exists {
fileName := fmt.Sprintf("./images/img" + strconv.Itoa(imageCount) + ".jpg")
currWebsite := fmt.Sprint(imgSrc)
fmt.Println("[+]", currWebsite)
DownloadFile(fileName, currWebsite)
imageCount++
}
})
}
}
func main() {
err := os.MkdirAll("./images/", 0777)
if err != nil {
log.Fatalln("error on creating directory")
}
response, err := http.Get(currWebsite)
if err != nil {
log.Fatalln("error on searching website")
}
defer response.Body.Close()
document, err := goquery.NewDocumentFromReader(response.Body)
if err != nil {
log.Fatalln("Error loading HTTP response body. ", err)
}
document.Find("a").Each(processElement)
}
func DownloadFile(filepath string, url string) {
response, err := http.Get(url)
if err != nil {
log.Fatalln("error getting the website infos")
}
defer response.Body.Close()
if response.StatusCode != 200 {
log.Fatalln("received non 200 response code")
}
file, err := os.Create(filepath)
if err != nil {
log.Fatalf("error creating file at %v\n", filepath)
}
defer file.Close()
_, err = io.Copy(file, response.Body)
if err != nil {
log.Fatalln("error copy file from src to dst")
}
}

答案1

得分: 1

是的,你是对的。你的代码不会从起始页面下载图片,因为它只会从起始页面获取所有锚点标签元素,然后对每个在起始页面找到的锚点元素调用processElement()函数。

要从起始页面下载所有图片,你应该定义另一个函数processUrl()来获取img元素并下载图片,然后在processElement()函数中只需要获取href链接并在该链接上调用processUrl()函数。

现在,只需在处理所有链接之前从起始页面爬取图片:

func main() {
    ...
    document, err := goquery.NewDocumentFromReader(response.Body)
    if err != nil {
        log.Fatalln("Error loading HTTP response body. ", err)
    }
    // 首先从起始页面URL爬取图片
    processUrl(currWebsite)
    document.Find("a").Each(processElement)
}
英文:

> (even I dont understand the images I get, cause I cant find them on the website, it seems like the crawler does not start with the startpage of the website).

Yes you are right. Your code will not download images from the start page because the only thing it is fetching from start page are all anchor tag elements and then calling processElement() for each anchor element found on the start page -

response, err := http.Get(currWebsite)
if err != nil {
log.Fatalln("error on searching website")
}
defer response.Body.Close()
document, err := goquery.NewDocumentFromReader(response.Body)
if err != nil {
log.Fatalln("Error loading HTTP response body. ", err)
}
document.Find("a").Each(processElement) // Here

To download all images from start page, you should define another function processUrl() to do the work of fetching img elements and download images but then in processElement() function you just need to get the href link and invoke processUrl() on that link -

func processElement(index int, element *goquery.Selection) {
    href, exists := element.Attr("href")
    if exists && strings.HasPrefix(href, "http") {
        crawlWebsite = href
        processUrl(crawlWebsite)
    }
}

func processUrl(crawlWebsite string) {
    response, err := http.Get(crawlWebsite)
    if err != nil {
        log.Fatalf("error on current website")
    }

    defer response.Body.Close()

    document, err := goquery.NewDocumentFromReader(response.Body)
    if err != nil {
        log.Fatal("Error loading HTTP response body.", err)
    }

    document.Find("img").Each(func(index int, element *goquery.Selection) {
        imgSrc, exists := element.Attr("src")
        if strings.HasPrefix(imgSrc, "http") && exists {
            fileName := fmt.Sprintf("./images/img" + strconv.Itoa(imageCount) + ".jpg")
            currWebsite := fmt.Sprint(imgSrc)
            fmt.Println("[+]", currWebsite)
            DownloadFile(fileName, currWebsite)
            imageCount++
        }
    })
}

Now just crawl images from start page before processing all links -

func main() {
    ...
    document, err := goquery.NewDocumentFromReader(response.Body)
    if err != nil {
        log.Fatalln("Error loading HTTP response body. ", err)
    }
    // First crawl images from start page url
    processUrl(currWebsite)
    document.Find("a").Each(processElement)
}

huangapple
  • 本文由 发表于 2021年7月13日 16:25:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/68358940.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定