2023年4月11日 16:29:05go评论155阅读模式

英文:

Scrapy Crawl only first 5 pages of the site

问题

I am working on the solution to the following problem, My boss wants from me to create a CrawlSpider in Scrapy to scrape the article details like title, description and paginate only the first 5 pages.

我正在解决以下问题，我的老板希望我创建一个Scrapy中的CrawlSpider来抓取文章的详细信息，如标题，描述，并且只翻页前5页。

I created a CrawlSpider but it is paginating from all the pages, How can I restrict the CrawlSpider to paginate only the first latest 5 pages?

我创建了一个CrawlSpider，但它正在从所有页面翻页，如何限制CrawlSpider只翻页最新的前5页？

The site article listing page markup that opens when we click on pagination next link:

网站文章列表页的标记，当我们点击翻页的下一页链接时打开：

Listing page markup:

列表页面标记：

    &lt;div class=&quot;list&quot;&gt;
      &lt;div class=&quot;snippet-content&quot;&gt;
        &lt;h2&gt;
          &lt;a href=&quot;https://example.com/article-1&quot;&gt;Article 1&lt;/a&gt;
        &lt;/h2&gt;
      &lt;/div&gt;
      &lt;div class=&quot;snippet-content&quot;&gt;
        &lt;h2&gt;
          &lt;a href=&quot;https://example.com/article-2&quot;&gt;Article 2&lt;/a&gt;
        &lt;/h2&gt;
      &lt;/div&gt;
      &lt;div class=&quot;snippet-content&quot;&gt;
        &lt;h2&gt;
          &lt;a href=&quot;https://example.com/article-3&quot;&gt;Article 3&lt;/a&gt;
        &lt;/h2&gt;
      &lt;/div&gt;
      &lt;div class=&quot;snippet-content&quot;&gt;
        &lt;h2&gt;
          &lt;a href=&quot;https://example.com/article-4&quot;&gt;Article 4&lt;/a&gt;
        &lt;/h2&gt;
      &lt;/div&gt;
    &lt;/div&gt;
    &lt;ul class=&quot;pagination&quot;&gt;
      &lt;li class=&quot;next&quot;&gt;
        &lt;a href=&quot;https://www.example.com?page=2&amp;keywords=&amp;from=&amp;topic=&amp;year=&amp;type=&quot;&gt; Next &lt;/a&gt;
      &lt;/li&gt;
    &lt;/ul&gt;

For this, I am using Rule object with restrict_xpaths argument to get all the article links, and for the follow I am executing parse_item class method that will get the article title and description from the meta tags.

为此，我使用Rule对象和restrict_xpaths参数来获取所有文章链接，然后在跟踪时执行parse_item类方法，该方法将从meta标签中获取文章的标题和描述。

Rule(LinkExtractor(restrict_xpaths='//div[contains(@class, "snippet-content")]/h2/a'), callback="parse_item",
follow=True)

Detail page markup:

详细页面标记：

    &lt;meta property=&quot;og:title&quot; content=&quot;Article Title&quot;&gt;
    &lt;meta property=&quot;og:description&quot; content=&quot;Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularized in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.&quot;&gt;

After this, I have added another Rule object to handle pagination CrawlSpider will use the following link to open other listing page and do the same procedure again and again.

在此之后，我添加了另一个Rule对象来处理翻页，CrawlSpider将使用以下链接打开其他列表页面并一遍又一遍地执行相同的过程。

Rule(LinkExtractor(restrict_xpaths='//ul[@class="pagination&quot]/li[@class="next&quot]/a'))

This is my CrawlSpider code:

这是我的CrawlSpider代码：

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import w3lib.html

class ExampleSpider(CrawlSpider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["https://www.example.com/"]
custom_settings = {
'FEED_URI': 'articles.json',
'FEED_FORMAT': 'json'
}
total = 0

rules = (
    # Get the list of all articles on the one page and follow these links
    Rule(LinkExtractor(restrict_xpaths=&#39;//div[contains(@class, &quot;snippet-content&quot;)]/h2/a&#39;), callback=&quot;parse_item&quot;,
         follow=True),
    # After that get pagination next link get href and follow it, repeat the cycle
    Rule(LinkExtractor(restrict_xpaths=&#39;//ul[@class=&quot;pagination&quot]/li[@class=&quot;next&quot]/a&#39;))
)

def parse_item(self, response):
    self.total = self.total + 1
    title = response.xpath(&#39;//meta[@property=&quot;og:title&quot;]/@content&#39;).get() or &quot;&quot;
    description = w3lib.html.remove_tags(response.xpath(&#39;//meta[@property=&quot;og:description&quot]/@content&#39;).get()) or &quot;&quot;
   
    return {
        &#39;id&#39;: self.total,
        &#39;title&#39;: title,
        &#39;description&#39;: description
    }

Is there a way we can restrict the crawler to crawl only the first 5 pages?

有办法限制爬虫只翻页前5页吗？

英文:

I created a CrawlSpider but it is paginating from all the pages, How can I restrict the CrawlSpider to paginate only the first latest 5 pages?

The site article listing page markup that opens when we click on pagination next link:

Listing page markup:

    &lt;div class=&quot;list&quot;&gt;
      &lt;div class=&quot;snippet-content&quot;&gt;
        &lt;h2&gt;
          &lt;a href=&quot;https://example.com/article-1&quot;&gt;Article 1&lt;/a&gt;
        &lt;/h2&gt;
      &lt;/div&gt;
      &lt;div class=&quot;snippet-content&quot;&gt;
        &lt;h2&gt;
          &lt;a href=&quot;https://example.com/article-2&quot;&gt;Article 2&lt;/a&gt;
        &lt;/h2&gt;
      &lt;/div&gt;
      &lt;div class=&quot;snippet-content&quot;&gt;
        &lt;h2&gt;
          &lt;a href=&quot;https://example.com/article-3&quot;&gt;Article 3&lt;/a&gt;
        &lt;/h2&gt;
      &lt;/div&gt;
      &lt;div class=&quot;snippet-content&quot;&gt;
        &lt;h2&gt;
          &lt;a href=&quot;https://example.com/article-4&quot;&gt;Article 4&lt;/a&gt;
        &lt;/h2&gt;
      &lt;/div&gt;
    &lt;/div&gt;
    &lt;ul class=&quot;pagination&quot;&gt;
      &lt;li class=&quot;next&quot;&gt;
        &lt;a href=&quot;https://www.example.com?page=2&amp;keywords=&amp;from=&amp;topic=&amp;year=&amp;type=&quot;&gt; Next &lt;/a&gt;
      &lt;/li&gt;
    &lt;/ul&gt;

Rule(LinkExtractor(restrict_xpaths=&#39;//div[contains(@class, &quot;snippet-content&quot;)]/h2/a&#39;), callback=&quot;parse_item&quot;,
             follow=True)

Detail page markup:

&lt;meta property=&quot;og:title&quot; content=&quot;Article Title&quot;&gt;
&lt;meta property=&quot;og:description&quot; content=&quot;Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry&#39;s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.&quot;&gt;

After this, I have added another Rule object to handle pagination CrawlSpider will use the following link to open other listing page and do the same procedure again and again.

Rule(LinkExtractor(restrict_xpaths=&#39;//ul[@class=&quot;pagination&quot;]/li[@class=&quot;next&quot;]/a&#39;))

This is my CrawlSpider code:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import w3lib.html


class ExampleSpider(CrawlSpider):
    name = &quot;example&quot;
    allowed_domains = [&quot;example.com&quot;]
    start_urls = [&quot;https://www.example.com/&quot;]
    custom_settings = {
        &#39;FEED_URI&#39;: &#39;articles.json&#39;,
        &#39;FEED_FORMAT&#39;: &#39;json&#39;
    }
    total = 0

   
    rules = (
        # Get the list of all articles on the one page and follow these links
        Rule(LinkExtractor(restrict_xpaths=&#39;//div[contains(@class, &quot;snippet-content&quot;)]/h2/a&#39;), callback=&quot;parse_item&quot;,
             follow=True),
        # After that get pagination next link get href and follow it, repeat the cycle
        Rule(LinkExtractor(restrict_xpaths=&#39;//ul[@class=&quot;pagination&quot;]/li[@class=&quot;next&quot;]/a&#39;))
    )

    def parse_item(self, response):
        self.total = self.total + 1
        title = response.xpath(&#39;//meta[@property=&quot;og:title&quot;]/@content&#39;).get() or &quot;&quot;
        description = w3lib.html.remove_tags(response.xpath(&#39;//meta[@property=&quot;og:description&quot;]/@content&#39;).get()) or &quot;&quot;
       
        return {
            &#39;id&#39;: self.total,
            &#39;title&#39;: title,
            &#39;description&#39;: description
        }

Is there a way we can restrict the crawler to crawl only the first 5 pages?

答案1

得分: 2

解决方案 1： 使用 process_request。

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

def limit_requests(request, response):
    # 这里我们有页面编号。
    # page_number = request.url[-1]
    # if int(page_number) &gt;= 6:
    #     return None

    # 这里我们使用一个计数器
    if not hasattr(limit_requests, &quot;page_number&quot;):
        limit_requests.page_number = 0
    limit_requests.page_number += 1

    if limit_requests.page_number &gt;= 5:
        return None

    return request

class ExampleSpider(CrawlSpider):
    name = &#39;example_spider&#39;

    start_urls = [&#39;https://scrapingclub.com/exercise/list_basic/&#39;]
    page = 0
    rules = (
        # 获取一页上所有文章的列表并跟踪这些链接
        Rule(LinkExtractor(restrict_xpaths=&#39;//div[@class=&quot;card-body&quot;]/h4/a&#39;), callback=&quot;parse_item&quot;,
             follow=True),
        # 然后获取分页的下一页链接的 href 并跟踪它，重复这个循环
        Rule(LinkExtractor(restrict_xpaths=&#39;//li[@class=&quot;page-item&quot;][last()]/a&#39;), process_request=limit_requests)
    )
    total = 0

    def parse_item(self, response):
        title = response.xpath(&#39;//h3//text()&#39;).get(default=&#39;&#39;)
        price = response.xpath(&#39;//div[@class=&quot;card-body&quot;]/h4//text()&#39;).get(default=&#39;&#39;)
        self.total = self.total + 1

        return {
            &#39;id&#39;: self.total,
            &#39;title&#39;: title,
            &#39;price&#39;: price
        }

解决方案 2： 重写 _requests_to_follow 方法（虽然可能较慢）。

from scrapy.http import HtmlResponse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class ExampleSpider(CrawlSpider):
    name = &#39;example_spider&#39;

    start_urls = [&#39;https://scrapingclub.com/exercise/list_basic/&#39;]

    rules = (
        # 获取一页上所有文章的列表并跟踪这些链接
        Rule(LinkExtractor(restrict_xpaths=&#39;//div[@class=&quot;card-body&quot]/h4/a&#39;), callback=&quot;parse_item&quot;,
             follow=True),
        # 然后获取分页的下一页链接的 href 并跟踪它，重复这个循环
        Rule(LinkExtractor(restrict_xpaths=&#39;//li[@class=&quot;page-item&quot;][last()]/a&#39;))
    )
    total = 0
    page = 0

    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        if self.page &gt;= 5:  # 停止条件
            return
        seen = set()
        for rule_index, rule in enumerate(self._rules):
            links = [
                lnk
                for lnk in rule.link_extractor.extract_links(response)
                if lnk not in seen
            ]
            for link in rule.process_links(links):
                if rule_index == 1: # 假设只有一个“下一页”按钮
                    self.page += 1
                seen add(link)
                request = self._build_request(rule_index, link)
                yield rule.process_request(request, response)

    def parse_item(self, response):
        title = response.xpath(&#39;//h3//text()&#39;).get(default=&#39;&#39;)
        price = response.xpath(&#39;//div[@class=&quot;card-body&quot]/h4//text()&#39;).get(default=&#39;&#39;)
        self.total = self.total + 1

        return {
            &#39;id&#39;: self.total,
            &#39;title&#39;: title,
            &#39;price&#39;: price
        }

这些解决方案都是比较自解释的，如果您需要添加其他内容，请在评论中提问。

英文:

Solution 1: use process_request.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


def limit_requests(request, response):
    # here we have the page number.
    # page_number = request.url[-1]
    # if int(page_number) &gt;= 6:
    #     return None

    # here we use a counter
    if not hasattr(limit_requests, &quot;page_number&quot;):
        limit_requests.page_number = 0
    limit_requests.page_number += 1

    if limit_requests.page_number &gt;= 5:
        return None

    return request


class ExampleSpider(CrawlSpider):
    name = &#39;example_spider&#39;

    start_urls = [&#39;https://scrapingclub.com/exercise/list_basic/&#39;]
    page = 0
    rules = (
        # Get the list of all articles on the one page and follow these links
        Rule(LinkExtractor(restrict_xpaths=&#39;//div[@class=&quot;card-body&quot;]/h4/a&#39;), callback=&quot;parse_item&quot;,
             follow=True),
        # After that get pagination next link get href and follow it, repeat the cycle
        Rule(LinkExtractor(restrict_xpaths=&#39;//li[@class=&quot;page-item&quot;][last()]/a&#39;), process_request=limit_requests)
    )
    total = 0

    def parse_item(self, response):
        title = response.xpath(&#39;//h3//text()&#39;).get(default=&#39;&#39;)
        price = response.xpath(&#39;//div[@class=&quot;card-body&quot;]/h4//text()&#39;).get(default=&#39;&#39;)
        self.total = self.total + 1

        return {
            &#39;id&#39;: self.total,
            &#39;title&#39;: title,
            &#39;price&#39;: price
        }

Solution 2: overwrite _requests_to_follow method (should be slower though).

from scrapy.http import HtmlResponse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class ExampleSpider(CrawlSpider):
    name = &#39;example_spider&#39;

    start_urls = [&#39;https://scrapingclub.com/exercise/list_basic/&#39;]

    rules = (
        # Get the list of all articles on the one page and follow these links
        Rule(LinkExtractor(restrict_xpaths=&#39;//div[@class=&quot;card-body&quot;]/h4/a&#39;), callback=&quot;parse_item&quot;,
             follow=True),
        # After that get pagination next link get href and follow it, repeat the cycle
        Rule(LinkExtractor(restrict_xpaths=&#39;//li[@class=&quot;page-item&quot;][last()]/a&#39;))
    )
    total = 0
    page = 0
    
    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        if self.page &gt;= 5:  # stopping condition
            return
        seen = set()
        for rule_index, rule in enumerate(self._rules):
            links = [
                lnk
                for lnk in rule.link_extractor.extract_links(response)
                if lnk not in seen
            ]
            for link in rule.process_links(links):
                if rule_index == 1: # assuming there&#39;s only one &quot;next button&quot;
                    self.page += 1
                seen.add(link)
                request = self._build_request(rule_index, link)
                yield rule.process_request(request, response)

    def parse_item(self, response):
        title = response.xpath(&#39;//h3//text()&#39;).get(default=&#39;&#39;)
        price = response.xpath(&#39;//div[@class=&quot;card-body&quot;]/h4//text()&#39;).get(default=&#39;&#39;)
        self.total = self.total + 1

        return {
            &#39;id&#39;: self.total,
            &#39;title&#39;: title,
            &#39;price&#39;: price
        }

The solutions are pretty much self explanatory, if you want me to add something please ask in the comments.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Scrapy仅爬取站点的前5页。

问题

答案1

如何修复Django中一堆无反向数学错误。

pandas 对列应用 apply（值为 set 类型）以检索第一个元素会导致错误。

Yahoo_fin库是否最近有更新；我收到一个断言错误？

多层C类型结构的字典的YAML表示得到了一个奇怪的对象。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论