2021年7月13日 16:25:47go评论180阅读模式

英文:

Web crawler stops at first page

问题

我正在开发一个网络爬虫，应该按照以下方式工作：

访问一个网站，抓取该网站上的所有链接
下载所有图片（从起始页面开始）
如果当前页面上没有剩余图片，转到步骤1中找到的下一个链接，并执行步骤2和3，直到没有链接/图片为止。

看起来下面的代码在某种程度上是有效的，比如当我尝试爬取一些网站时，我可以获取一些要下载的图片。

（尽管我不理解获取到的图片，因为我在网站上找不到它们，似乎爬虫没有从网站的起始页面开始）。

在获取了几张图片（约25-500张）之后，爬虫就停止了，没有错误，只是停止了。我尝试了多个网站，获取了几张图片后它就停止了。我认为爬虫在某种程度上忽略了步骤3。

package main

import (
	"fmt"
	"io"
	"log"
	"net/http"
	"os"
	"strconv"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

var (
	currWebsite  string = "https://www.youtube.com"
	imageCount   int    = 0
	crawlWebsite string
)

func processElement(index int, element *goquery.Selection) {
	href, exists := element.Attr("href")
	if exists && strings.HasPrefix(href, "http") {
		crawlWebsite = href
		response, err := http.Get(crawlWebsite)
		if err != nil {
			log.Fatalf("当前网站出错")
		}

		defer response.Body.Close()

		document, err := goquery.NewDocumentFromReader(response.Body)
		if err != nil {
			log.Fatal("加载HTTP响应正文时出错", err)
		}

		document.Find("img").Each(func(index int, element *goquery.Selection) {
			imgSrc, exists := element.Attr("src")
			if strings.HasPrefix(imgSrc, "http") && exists {
				fileName := fmt.Sprintf("./images/img" + strconv.Itoa(imageCount) + ".jpg")
				currWebsite := fmt.Sprint(imgSrc)
				fmt.Println("[+]", currWebsite)
				DownloadFile(fileName, currWebsite)
				imageCount++
			}
		})
	}
}

func main() {
	err := os.MkdirAll("./images/", 0777)
	if err != nil {
		log.Fatalln("创建目录时出错")
	}

	response, err := http.Get(currWebsite)
	if err != nil {
		log.Fatalln("搜索网站时出错")
	}

	defer response.Body.Close()

	document, err := goquery.NewDocumentFromReader(response.Body)
	if err != nil {
		log.Fatalln("加载HTTP响应正文时出错", err)
	}

	document.Find("a").Each(processElement)
}

func DownloadFile(filepath string, url string) {
	response, err := http.Get(url)
	if err != nil {
		log.Fatalln("获取网站信息时出错")
	}
	defer response.Body.Close()

	if response.StatusCode != 200 {
		log.Fatalln("收到非200响应代码")
	}

	file, err := os.Create(filepath)
	if err != nil {
		log.Fatalf("在 %v 处创建文件时出错\n", filepath)
	}

	defer file.Close()

	_, err = io.Copy(file, response.Body)
	if err != nil {
		log.Fatalln("从源文件复制文件时出错")
	}
}

英文:

I'm working on a webcrawler which should be working like this:

go to a website, crawl all links from the site
download all images (starting from the startpage)
if there are no images left on the current page, go to the next link found in step 1 and do step 2 and 3 until there are no links/images left.

It seems like the code below is somehow working, like when I try to crawl some sites, I get some images to download.

(even I dont understand the images I get, cause I cant find them on the website, it seems like the crawler does not start with the startpage of the website).

After a few images (~25-500), the crawler is done and stops, no errors, it just stops. I tried this with multiple websites and after a few images it just stops. I think the crawler somehow ignores step 3.

package main
import (
&quot;fmt&quot;
&quot;io&quot;
&quot;log&quot;
&quot;net/http&quot;
&quot;os&quot;
&quot;strconv&quot;
&quot;strings&quot;
&quot;github.com/PuerkitoBio/goquery&quot;
)
var (
currWebsite  string = &quot;https://www.youtube.com&quot;
imageCount   int    = 0
crawlWebsite string
)
func processElement(index int, element *goquery.Selection) {
href, exists := element.Attr(&quot;href&quot;)
if exists &amp;&amp; strings.HasPrefix(href, &quot;http&quot;) {
crawlWebsite = href
response, err := http.Get(crawlWebsite)
if err != nil {
log.Fatalf(&quot;error on current website&quot;)
}
defer response.Body.Close()
document, err := goquery.NewDocumentFromReader(response.Body)
if err != nil {
log.Fatal(&quot;Error loading HTTP response body.&quot;, err)
}
document.Find(&quot;img&quot;).Each(func(index int, element *goquery.Selection) {
imgSrc, exists := element.Attr(&quot;src&quot;)
if strings.HasPrefix(imgSrc, &quot;http&quot;) &amp;&amp; exists {
fileName := fmt.Sprintf(&quot;./images/img&quot; + strconv.Itoa(imageCount) + &quot;.jpg&quot;)
currWebsite := fmt.Sprint(imgSrc)
fmt.Println(&quot;[+]&quot;, currWebsite)
DownloadFile(fileName, currWebsite)
imageCount++
}
})
}
}
func main() {
err := os.MkdirAll(&quot;./images/&quot;, 0777)
if err != nil {
log.Fatalln(&quot;error on creating directory&quot;)
}
response, err := http.Get(currWebsite)
if err != nil {
log.Fatalln(&quot;error on searching website&quot;)
}
defer response.Body.Close()
document, err := goquery.NewDocumentFromReader(response.Body)
if err != nil {
log.Fatalln(&quot;Error loading HTTP response body. &quot;, err)
}
document.Find(&quot;a&quot;).Each(processElement)
}
func DownloadFile(filepath string, url string) {
response, err := http.Get(url)
if err != nil {
log.Fatalln(&quot;error getting the website infos&quot;)
}
defer response.Body.Close()
if response.StatusCode != 200 {
log.Fatalln(&quot;received non 200 response code&quot;)
}
file, err := os.Create(filepath)
if err != nil {
log.Fatalf(&quot;error creating file at %v\n&quot;, filepath)
}
defer file.Close()
_, err = io.Copy(file, response.Body)
if err != nil {
log.Fatalln(&quot;error copy file from src to dst&quot;)
}
}

答案1

得分: 1

是的，你是对的。你的代码不会从起始页面下载图片，因为它只会从起始页面获取所有锚点标签元素，然后对每个在起始页面找到的锚点元素调用processElement()函数。

要从起始页面下载所有图片，你应该定义另一个函数processUrl()来获取img元素并下载图片，然后在processElement()函数中只需要获取href链接并在该链接上调用processUrl()函数。

现在，只需在处理所有链接之前从起始页面爬取图片：

func main() {
    ...
    document, err := goquery.NewDocumentFromReader(response.Body)
    if err != nil {
        log.Fatalln("Error loading HTTP response body. ", err)
    }
    // 首先从起始页面URL爬取图片
    processUrl(currWebsite)
    document.Find("a").Each(processElement)
}

英文:

> (even I dont understand the images I get, cause I cant find them on the website, it seems like the crawler does not start with the startpage of the website).

Yes you are right. Your code will not download images from the start page because the only thing it is fetching from start page are all anchor tag elements and then calling processElement() for each anchor element found on the start page -

response, err := http.Get(currWebsite)
if err != nil {
log.Fatalln(&quot;error on searching website&quot;)
}
defer response.Body.Close()
document, err := goquery.NewDocumentFromReader(response.Body)
if err != nil {
log.Fatalln(&quot;Error loading HTTP response body. &quot;, err)
}
document.Find(&quot;a&quot;).Each(processElement) // Here

To download all images from start page, you should define another function processUrl() to do the work of fetching img elements and download images but then in processElement() function you just need to get the href link and invoke processUrl() on that link -

func processElement(index int, element *goquery.Selection) {
    href, exists := element.Attr(&quot;href&quot;)
    if exists &amp;&amp; strings.HasPrefix(href, &quot;http&quot;) {
        crawlWebsite = href
        processUrl(crawlWebsite)
    }
}

func processUrl(crawlWebsite string) {
    response, err := http.Get(crawlWebsite)
    if err != nil {
        log.Fatalf(&quot;error on current website&quot;)
    }

    defer response.Body.Close()

    document, err := goquery.NewDocumentFromReader(response.Body)
    if err != nil {
        log.Fatal(&quot;Error loading HTTP response body.&quot;, err)
    }

    document.Find(&quot;img&quot;).Each(func(index int, element *goquery.Selection) {
        imgSrc, exists := element.Attr(&quot;src&quot;)
        if strings.HasPrefix(imgSrc, &quot;http&quot;) &amp;&amp; exists {
            fileName := fmt.Sprintf(&quot;./images/img&quot; + strconv.Itoa(imageCount) + &quot;.jpg&quot;)
            currWebsite := fmt.Sprint(imgSrc)
            fmt.Println(&quot;[+]&quot;, currWebsite)
            DownloadFile(fileName, currWebsite)
            imageCount++
        }
    })
}

Now just crawl images from start page before processing all links -

func main() {
    ...
    document, err := goquery.NewDocumentFromReader(response.Body)
    if err != nil {
        log.Fatalln(&quot;Error loading HTTP response body. &quot;, err)
    }
    // First crawl images from start page url
    processUrl(currWebsite)
    document.Find(&quot;a&quot;).Each(processElement)
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

网络爬虫在第一页停止

问题

答案1

Golang中的switch语句出现了”used as value”错误？

Go语言中的树结构

尾部元素在Golang链表中未能插入到最后位置。

Zenhub API: {“message”:”验证失败：问题将在史诗中创建一个循环”}

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论