2023年3月7日 01:43:22go评论91阅读模式

英文:

Scrapy - recursive function as callback for pagination

问题

我遇到了一些关于Scrapy爬虫的困难。

parse()函数未按预期工作。它接收到一个带有搜索关键字的url的响应，然后对页面中的每个列表项进行处理，跟踪url以填充Scrapy Data项。

它有一个第二个yield，它以递归方式使用next_page url调用parse，直到达到max_page，以便获取以下页面中的所有列表项。

当调用scrapy crawl example -o output.json时，第二个yield在output.json文件中没有返回任何输出。

这是一个精简的爬虫代码的工作版本，如果添加到Scrapy项目中，可以重现该问题。

# 以下是代码...

英文:

I'm running into some difficulties with a Scrapy spider.

Function parse() is not working as it should. It receives a response for a url with a search keyword and then for each listing in the page follows the url to fill the Scrapy Data item.

It has a second yield which recursively calls parse with the next_page url until we reach max_page to also grab all the listings in the following pages.

The second yield isn't returning any output in the output.json file when calling scrapy crawl example -o output.json

Here is a reduced working version of the spider code which can reproduce the problem if added to a scrapy project.

import scrapy
class Data(scrapy.Item):
    page: int = scrapy.Field()
    url: str = scrapy.Field()
    description: str = scrapy.Field()
    user: str = scrapy.Field()
    images: list = scrapy.Field()
class Example(scrapy.Spider):
    name = &#39;example&#39;
    search = &#39;/search?category=&amp;keyword=&#39;
    keywords = [&#39;terrains&#39;, &#39;maison&#39;, &#39;land&#39;]
    max_pages = 2
    current_page = 1
    def gen_requests(self, url):
        for keyword in self.keywords:
            build_url = url + self.search
            kws = keyword.split(&#39; &#39;)
            if (len(kws)&gt;1):
                for (i, val) in enumerate(kws):
                    if (i == 0):
                        build_url += val
                    else:
                        build_url += f&#39;+{val}&#39;
            else:
                build_url += kws[0]
            yield scrapy.Request(build_url, meta={&#39;main_url&#39;:url, &#39;current_page&#39;:1}, callback=self.parse)
    def start_requests(self):
        urls = [&#39;https://ci.coinafrique.com&#39;, &#39;https://sn.coinafrique.com&#39;, &#39;https://bj.coinafrique.com&#39;]
        for url in urls:
            for request in self.gen_requests(url):
                yield request
    def parse(self, response):
        current_page = response.meta[&#39;current_page&#39;]
        main_url = response.meta[&#39;main_url&#39;]
        for listing in response.css(&#39;div.col.s6.m4&#39;):
            href = listing.xpath(&#39;.//p[@class=&quot;ad__card-description&quot;]/a/@href&#39;).get()
            yield scrapy.Request(response.urljoin(href), meta={&#39;current_page&#39;:current_page}, callback=self.followListing)
        try:
            next_page_url = response.css(&#39;li.pagination-indicator.direction a::attr(href)&#39;)[1].get()
            if next_page_url is not None and current_page &lt; self.max_pages:
                next_page = main_url + &#39;/search&#39; + next_page_url
                current_page += 1
                yield scrapy.Request(next_page, meta={&#39;main_url&#39;:main_url, &#39;current_page&#39;:1}, callback=self.parse)
        except:
            print(&#39;No next page found&#39;)
    def followListing(self, response):
        url = response.url
        current_page = response.meta[&#39;current_page&#39;]
        description = response.xpath(&#39;//div[@class=&quot;ad__info__box ad__info__box-descriptions&quot;]//text()&#39;).getall()[1]
        profile = response.css(&#39;div.profile-card__content&#39;)
        user = profile.xpath(&#39;.//p[@class=&quot;username&quot;]//text()&#39;).get()
        images = []
        for image in response.xpath(&#39;//div[contains(@class,&quot;slide-clickable&quot;)]/@style&#39;).re(r&#39;url\((.*)\)&#39;):
            images.append(image)
        yield Data(
            page=current_page,
            url=url,
            description=description,
            user=user,
            images=images
        )

If I swap the yield in the parse() function it returns only the max_page (ex. page 2) listings, it seems it only returns the results from the first yield in both cases.

def parse(self, response):
    current_page = response.meta[&#39;current_page&#39;]
    main_url = response.meta[&#39;main_url&#39;]
    try:
        next_page_url = response.css(&#39;li.pagination-indicator.direction a::attr(href)&#39;)[1].get()
        if next_page_url is not None and current_page &lt; self.max_pages:
            next_page = main_url + &#39;/search&#39; + next_page_url
            current_page += 1
            yield scrapy.Request(next_page, meta={&#39;main_url&#39;:main_url, &#39;current_page&#39;:1}, callback=self.parse)
    except:
        print(&#39;No next page found&#39;)
    for listing in response.css(&#39;div.col.s6.m4&#39;):
        href = listing.xpath(&#39;.//p[@class=&quot;ad__card-description&quot;]/a/@href&#39;).get()
        yield scrapy.Request(response.urljoin(href), meta={&#39;current_page&#39;:current_page}, callback=self.followListing)

答案1

得分: 1

以下是翻译好的部分：

"Instead of using the requests meta dictionary to pass around variables in between request methods, scrapy has the cb_kwargs parameter for just that. However in this instance neither are actually necessary."

"不必使用请求的元数据字典在请求方法之间传递变量，Scrapy 提供了 cb_kwargs 参数来实现这一点。然而，在这种情况下，实际上都不是必要的。"

"The reason it's not working is because something about how you construct the url for the next page is failing. So instead of using the main_url and the current_page variables you can get the current page from the pagination elements at the bottom of the page by looking for the page link that has active as its class name, and then getting that element's sibling to find the next page. Then you can reconstruct the relative link with response.urljoin."

"它不起作用的原因是因为你构建下一页的URL的方式有问题。所以，不要使用 main_url 和 current_page 变量，你可以从页面底部的分页元素中获取当前页，找到具有 active 类名的页面链接，然后获取该元素的兄弟元素以找到下一页。然后，你可以使用 response.urljoin 重建相对链接。"

"You can do the same in the followlisting method in order to get the current page."

"你可以在 followlisting 方法中执行相同的操作，以获取当前页。"

所以，总的来说，你的爬虫看起来应该像这样。

英文:

Instead of using the requests meta dictionary to pass around variables in between request methods, scrapy has the cb_kwargs parameter for just that. However in this instance neither are actually necessary.

The reason it's not working is because something about how you construct the url for the next page is failing. So instead of using the main_url and the current_page variables you can get the current page from the pagination elements at the bottom of the page by looking for the page link that has active as it's class name, and then getting that elements sibling to find the next page. Then you can reconstruct the relative link with response.urljoin.

For example:

    def parse(self, response):
        current_page = response.xpath(&#39;//li/span[@class=&quot;active&quot;]&#39;)
        current_text = current_page.xpath(&#39;.//text()&#39;).get()
        for listing in response.css(&#39;div.col.s6.m4&#39;):
            href = listing.xpath(&#39;.//p[@class=&quot;ad__card-description&quot;]/a/@href&#39;).get()
            yield scrapy.Request(response.urljoin(href), callback=self.followListing, cb_kwargs={&quot;current_page&quot;:current_text})
        next_page = current_page.xpath(&#39;./following-sibling::span/a/@href&#39;).get()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

You can do the same in the followlisting method in order to get the current page.

    def followListing(self, response, current_page):
        description = response.xpath(&#39;//div[@class=&quot;ad__info__box ad__info__box-descriptions&quot;]//text()&#39;).getall()
        description = description[1] if description else &quot;&quot;
        profile = response.css(&#39;div.profile-card__content&#39;)
        user = profile.xpath(&#39;.//p[@class=&quot;username&quot;]//text()&#39;).get()
        images = []
        for image in response.xpath(&#39;//div[contains(@class,&quot;slide-clickable&quot;)]/@style&#39;).re(r&#39;url\((.*)\)&#39;):
            images.append(image)
        yield Data(
            url=response.url,
            current_page=current_page,
            description=description,
            user=user,
            images=images
        )

So in total your spider would look like this:

import scrapy
class Data(scrapy.Item):
    page: int = scrapy.Field()
    url: str = scrapy.Field()
    current_page = scrapy.Field()
    description: str = scrapy.Field()
    user: str = scrapy.Field()
    images: list = scrapy.Field()
class Example(scrapy.Spider):
    name = &#39;example&#39;
    search = &#39;/search?category=&amp;keyword=&#39;
    keywords = [&#39;terrains&#39;, &#39;maison&#39;, &#39;land&#39;]
    max_pages = 2
    current_page = 1
    def gen_requests(self, url):
        for keyword in self.keywords:
            build_url = url + self.search
            kws = keyword.split(&#39; &#39;)
            if (len(kws)&gt;1):
                for (i, val) in enumerate(kws):
                    if (i == 0):
                        build_url += val
                    else:
                        build_url += f&#39;+{val}&#39;
            else:
                build_url += kws[0]
            yield scrapy.Request(build_url, callback=self.parse)
    def start_requests(self):
        urls = [&#39;https://ci.coinafrique.com&#39;, &#39;https://sn.coinafrique.com&#39;, &#39;https://bj.coinafrique.com&#39;]
        for url in urls:
            for request in self.gen_requests(url):
                yield request
    def parse(self, response):
        current_page = response.xpath(&#39;//li/span[@class=&quot;active&quot;]&#39;)
        current_text = current_page.xpath(&#39;.//text()&#39;).get()
        for listing in response.css(&#39;div.col.s6.m4&#39;):
            href = listing.xpath(&#39;.//p[@class=&quot;ad__card-description&quot;]/a/@href&#39;).get()
            yield scrapy.Request(response.urljoin(href), callback=self.followListing, cb_kwargs={&quot;current_page&quot;:current_text})
        if int(current_text) &lt; self.max_pages:
            next_page = current_page.xpath(&#39;./following-sibling::span/a/@href&#39;).get()
            if next_page:
                yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
    def followListing(self, response, current_page):
        description = response.xpath(&#39;//div[@class=&quot;ad__info__box ad__info__box-descriptions&quot;]//text()&#39;).getall()
        description = description[1] if description else &quot;&quot;
        profile = response.css(&#39;div.profile-card__content&#39;)
        user = profile.xpath(&#39;.//p[@class=&quot;username&quot;]//text()&#39;).get()
        images = []
        for image in response.xpath(&#39;//div[contains(@class,&quot;slide-clickable&quot;)]/@style&#39;).re(r&#39;url\((.*)\)&#39;):
            images.append(image)
        yield Data(
            url=response.url,
            current_page=current_page,
            description=description,
            user=user,
            images=images
        )

partial output from running the above with

scrapy crawl example -o results.json

{&quot;url&quot;: &quot;https://sn.coinafrique.com/annonce/terrains/vente-terrains-150-m2-mbao-4094405&quot;, &quot;current_page&quot;: &quot;4&quot;, &quot;description&quot;: &quot;Kalimo city situ\u00e9 \u00e0 30mm du centre-ville de dakar et \u00e0 10mn du lac rose plus pr\u00e9cis\u00e9ment \u00e0 ndiakhirate, proche de l&#39;autoroute \u00e0 p\u00e9age a1 sortie 10 de diamniadio, aibd et du prolongement de la vdn. \ncette nouvelle cit\u00e9 disposant de toutes les commodit\u00e9s vous propose des parcelles de 150m\u00b2 en cours de viabilisation \u00e0 12 500 000 ht payables sur 2ans. \nmodalit\u00e9s de paiement : apport de r\u00e9servation 50% soit 6 250 000 + 200.000 pour les frais d\u2019ouverture de dossier et le reliquat \u00e9tal\u00e9 sur 2ans soit 260 416/ mois sans int\u00e9r\u00eat. \nnature juridique : titre foncier individuel&quot;, &quot;user&quot;: &quot;Fatou Thiam&quot;, &quot;images&quot;: [&quot;https://images.coinafrique.com/4094405_uploaded_image1_1676373698.jpg&quot;, &quot;https://images.coinafrique.com/4094405_uploaded_image2_1676373698.jpeg&quot;, &quot;https://images.coinafrique.com/4094405_uploaded_image1_1676373698.jpg&quot;, &quot;https://images.coinafrique.com/4094405_uploaded_image2_1676373698.jpeg&quot;]},
{&quot;url&quot;: &quot;https://sn.coinafrique.com/annonce/voitures/toyota-land-cruiser-2012-3898070&quot;, &quot;current_page&quot;: &quot;2&quot;, &quot;description&quot;: &quot;Toyota tr\u00e8s bien entretenu. moteur impeccable.&quot;, &quot;user&quot;: &quot;Mouhamed Seck&quot;, &quot;images&quot;: [&quot;https://images.coinafrique.com/3898070_uploaded_image1_1664451625.jpg&quot;, &quot;https://images.coinafrique.com/3898070_uploaded_image1_1664451625.jpg&quot;]},
{&quot;url&quot;: &quot;https://sn.coinafrique.com/annonce/voitures/toyota-land-cruiser-2016-3898332&quot;, &quot;current_page&quot;: &quot;2&quot;, &quot;description&quot;: &quot;Vente prado vxr 2016 full option 7 places en tres bon \u00e9tat&quot;, &quot;user&quot;: &quot;Arnaud Tavarez&quot;, &quot;images&quot;: [&quot;https://images.coinafrique.com/3898332_uploaded_image1_1664461271.jpg&quot;, &quot;https://images.coinafrique.com/3898332_uploaded_image2_1664461271.jpeg&quot;, &quot;https://images.coinafrique.com/3898332_uploaded_image3_1664461271.jpeg&quot;, &quot;https://images.coinafrique.com/3898332_uploaded_image1_1664461271.jpg&quot;, &quot;https://images.coinafrique.com/3898332_uploaded_image2_1664461271.jpeg&quot;, &quot;https://images.coinafrique.com/3898332_uploaded_image3_1664461271.jpeg&quot;]},
{&quot;url&quot;: &quot;https://sn.coinafrique.com/annonce/voitures/land-rover-range-rover-2014-3860928&quot;, &quot;current_page&quot;: &quot;2&quot;, &quot;description&quot;: &quot;Range rover sport hse\nPremi\u00e8re inscription09/2014\nPuissance215 kw (292 ch)\nType de carburant diesel\nTransmissionautomatique\nClasse d&#39;\u00e9mission euro5\nClimatisation (climatisation\nAide au stationnement avant, arri\u00e8re\nVerrouillage centralis\u00e9 sans cl\u00e9\nDirection assist\u00e9e traction int\u00e9grale pneus tout temps pare-brise chauffant volant chauffant Bluetooth ordinateur de bord\nLecteur cd vitres \u00e9lectriques r\u00e9troviseur \u00e9lectrique r\u00e9glage de si\u00e8ge \u00e9lectrique antid\u00e9marrage Electrique\nVolant multifonctionnel\nSyst\u00e8me de navigation\nCommande vocale\nD\u00e9marrage/arr\u00eat automatique\nR\u00e9gulateur de vitesse\nEcran tactile\nPhares au x\u00e9non                                                               &quot;, &quot;user&quot;: &quot;MANSA STORE&quot;, &quot;images&quot;: [&quot;https://images.coinafrique.com/3860928_uploaded_image1_1662209243.jpg&quot;, &quot;https://images.coinafrique.com/3860928_uploaded_image2_1662209244.jpeg&quot;, &quot;https://images.coinafrique.com/3860928_uploaded_image3_1662209244.jpeg&quot;, &quot;https://images.coinafrique.com/3860928_uploaded_image4_1662209244.jpeg&quot;, &quot;https://images.coinafrique.com/3860928_uploaded_image5_1662209244.jpeg&quot;, &quot;https://images.coinafrique.com/3860928_uploaded_image6_1662209244.jpeg&quot;, &quot;https://images.coinafrique.com/3860928_uploaded_image1_1662209243.jpg&quot;, &quot;https://images.coinafrique.com/3860928_uploaded_image2_1662209244.jpeg&quot;, &quot;https://images.coinafrique.com/3860928_uploaded_image3_1662209244.jpeg&quot;, &quot;https://images.coinafrique.com/3860928_uploaded_image4_1662209244.jpeg&quot;, &quot;https://images.coinafrique.com/3860928_uploaded_image5_1662209244.jpeg&quot;, &quot;https://images.coinafrique.com/3860928_uploaded_image6_1662209244.jpeg&quot;]},
{&quot;url&quot;: &quot;https://sn.coinafrique.com/annonce/voitures/toyota-land-cruiser-2018-3901898&quot;, &quot;current_page&quot;: &quot;2&quot;, &quot;description&quot;: &quot;Toyota prado land cruiser vx anne 2018 automatique diesel 5 palace full options grand \u00e9cran cam\u00e9ra de recul frigo bar \r\ndisponibles chez moi&quot;, &quot;user&quot;: &quot;Aly D\u00e9me&quot;, &quot;images&quot;: [&quot;https://images.coinafrique.com/3901898_uploaded_image1_1664670013.jpg&quot;, &quot;https://images.coinafrique.com/3901898_uploaded_image2_1664669859.jpeg&quot;, &quot;https://images.coinafrique.com/3901898_uploaded_image3_1664669859.jpeg&quot;, &quot;https://images.coinafrique.com/3901898_uploaded_image1_1664670013.jpg&quot;, &quot;https://images.coinafrique.com/3901898_uploaded_image2_1664669859.jpeg&quot;, &quot;https://images.coinafrique.com/3901898_uploaded_image3_1664669859.jpeg&quot;]},
{&quot;url&quot;: &quot;https://sn.coinafrique.com/annonce/terrains/terrain-700-m2-yoff-4055054&quot;, &quot;current_page&quot;: &quot;4&quot;, &quot;description&quot;: &quot;Terrain 700m2 pieds dans l\u2019eau virage   - yoff\na vendre au virage yoff,\nune parcelle pieds l\u2019eau, \npour les amoureux \nde brise de mer ( 700 m2 )\nprix: 630.000.000 fcfa &quot;, &quot;user&quot;: &quot;OVHA GROUP&quot;, &quot;images&quot;: [&quot;https://images.coinafrique.com/4055054_uploaded_image1_1674042554.jpg&quot;, &quot;https://images.coinafrique.com/4055054_uploaded_image1_1674042554.jpg&quot;]},
{&quot;url&quot;: &quot;https://sn.coinafrique.com/annonce/voitures/land-rover-range-rover-vogue-2020-3889515&quot;, &quot;current_page&quot;: &quot;2&quot;, &quot;description&quot;: &quot;Prix d\u00e9douan\u00e9 \n\na vendre magnifique range rover vogue v6 diesel \n\nfull option \n\nv\u00e9hicule diplomatique, entretenu exclusivement chez range rover casablanca. \n\n2 cl\u00e9s / parfait \u00e9tat                           &quot;, &quot;user&quot;: &quot;Auto Elegance&quot;, &quot;images&quot;: [&quot;https://images.coinafrique.com/3889515_uploaded_image1_1663920624.jpg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image2_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image3_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image4_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image5_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image6_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image7_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image8_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image9_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image1_1663920624.jpg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image2_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image3_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image4_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image5_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image6_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image7_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image8_1663920624.jpeg&quot;, &quot;https://images.coinafrique.com/3889515_uploaded_image9_1663920624.jpeg&quot;]}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Scrapy – 递归函数作为分页的回调

问题

答案1

How can i make shuffle in django Forms?

基于多列目标要求，对Pandas DataFrame 进行高效的随机子采样。

如何在CVXPY中对矩阵变量进行向量化？

mpirun, Python, and task mapping

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。