问题

To go to the author's page for each quote and parse the date of birth, you can modify your Scrapy spider as follows:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        quotes = response.xpath('//div[@class="quote"]')

        for quote in quotes:
            item = {}
            item['name'] = quote.xpath('.//span[@class="text"]/text()').get()
            item['author'] = quote.xpath('.//small[@class="author"]/text()').get()
            item['tags'] = quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
            
            # Extract the author's page URL and follow it
            author_url = quote.xpath('.//small[@class="author"]/../a/@href').get()
            yield response.follow(author_url, self.parse_author_page, meta={'item': item})

        new_page = response.xpath('//li[@class="next"]/a/@href').get()
        if new_page is not None:
            yield response.follow(new_page, self.parse)

    def parse_author_page(self, response):
        item = response.meta['item']
        item['date_of_birth'] = response.xpath('//span[@class="author-born-date"]/text()').get()
        yield item

This code adds a new method parse_author_page that follows the author's page URL for each quote and extracts the date of birth. The meta attribute is used to pass the item between the main parse method and the parse_author_page method.

英文:

I'm learning Scrapy. For example, there is a website http://quotes.toscrape.com .
I'm creating a simple spider (scrapy genspider quotes).
I want to parse quotes, as well as go to the author's page and parse his date of birth.
I'm trying to do it this way, but nothing works.

import scrapy


class QuotesSpider(scrapy.Spider):
    name = &quot;quotes&quot;
    allowed_domains = [&quot;quotes.toscrape.com&quot;]
    start_urls = [&quot;http://quotes.toscrape.com/&quot;]

    def parse(self, response):
        
        quotes=response.xpath(&#39;//div[@class=&quot;quote&quot;]&#39;) 
        
        item={}

        for quote in quotes: 
            item[&#39;name&#39;]=quote.xpath(&#39;.//span[@class=&quot;text&quot;]/text()&#39;).get()
            item[&#39;author&#39;]=quote.xpath(&#39;.//small[@class=&quot;author&quot;]/text()&#39;).get()
            item[&#39;tags&#39;]=quote.xpath(&#39;.//div[@class=&quot;tags&quot;]/a[@class=&quot;tag&quot;]/text()&#39;).getall()
            url=quote.xpath(&#39;.//small[@class=&quot;author&quot;]/../a/@href&#39;).get()
            response.follow(url, self.parse_additional_page, item) 
            

        new_page=response.xpath(&#39;//li[@class=&quot;next&quot;]/a/@href&#39;).get() 

        if new_page is not None: 

            yield response.follow(new_page,self.parse) 
            
    def parse_additional_page(self, response, item): 
        item[&#39;additional_data&#39;] = response.xpath(&#39;//span[@class=&quot;author-born-date&quot;]/text()&#39;).get() 
        yield item

Code without date of birth (is correct):

import scrapy 

  

  

class QuotesSpiderSpider(scrapy.Spider): 

    name = &quot;quotes_spider&quot; 

    allowed_domains = [&quot;quotes.toscrape.com&quot;] 

    start_urls = [&quot;https://quotes.toscrape.com/&quot;] 

     

    def parse(self, response): 

        quotes=response.xpath(&#39;//div[@class=&quot;quote&quot;]&#39;) 

        for quote in quotes: 

            yield { 

                &#39;name&#39;:quote.xpath(&#39;.//span[@class=&quot;text&quot;]/text()&#39;).get(), 

                &#39;author&#39;:quote.xpath(&#39;.//small[@class=&quot;author&quot;]/text()&#39;).get(), 

                &#39;tags&#39;:quote.xpath(&#39;.//div[@class=&quot;tags&quot;]/a[@class=&quot;tag&quot;]/text()&#39;).getall() 

                } 

        new_page=response.xpath(&#39;//li[@class=&quot;next&quot;]/a/@href&#39;).get() 

        if new_page is not None: 

            yield response.follow(new_page,self.parse)

Question: how to go to the author's page for each quote and parse the date of birth?

How to go to the author's page for each quote and parse the date of birth?

答案1

得分: 1

以下是您提供的代码的翻译部分：

你实际上离正确的方法非常接近。只有一些您遗漏的地方和需要移动的一件事情。

1. `response.follow` 返回一个请求对象，所以除非您`yield`该请求对象，否则它永远不会从Scrapy引擎中派发。

2. 当从一个回调函数传递对象到另一个回调函数时，应该使用`cb_kwargs`参数。使用`meta`字典也可以，但Scrapy官方更倾向于使用`cb_kwargs`。然而，简单地将其作为位置参数传递是行不通的。

3. 字典是可变的，包括当它们用作Scrapy项目时。因此，当您创建Scrapy项目时，每个单独的项目应该是唯一的。否则，当您稍后更新该项目时，可能会导致变异先前产生的项目。

以下是一个使用您的代码但实现了我上面提到的三个要点的示例：

```python
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            # 将项目构造函数移到循环内部意味着每个项目都将是唯一的
            item = {}

            item['name'] = quote.xpath('.//span[@class="text"]/text()').get()
            item['author'] = quote.xpath('.//small[@class="author"]/text()').get()
            item['tags'] = quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
            url = quote.xpath('.//small[@class="author"]/../a/@href').get()
            # 您必须yield response.follow返回的请求
            yield response.follow(url, self.parse_additional_page, cb_kwargs={"item": item})
        new_page = response.xpath('//li[@class="next"]/a/@href').get()
        if new_page is not None:
            yield response.follow(new_page)

    def parse_additional_page(self, response, item=None):
        item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get()
        yield item

部分输出：

2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Martin-Luther-King-Jr/>
{'name': '“Only in the darkness can you see the stars.”', 'author': 'Martin Luther King Jr.', 'tags': ['hope', 'inspirational'], 'additional_data': 'January 15, 1929'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/C-S-Lewis/>
{'name': '“You can never get a cup of tea large enough or a book long enough to suit me.”', 'author': 'C.S. Lewis', 'tags': ['books', 'inspirational', 'reading', 'tea'], 'additional_data': 'November 29, 1898'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/George-R-R-Martin/>
{'name': '“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”', 'author': 'George R.R. Martin', 'tags': ['read', 'readers', 'reading', 'reading-books'], 'additional_data': 'September 20, 1948'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/James-Baldwin/>
{'name': '“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”', 'author': 'James Baldwin', 'tags': ['love'], 'additional_data': 'August 02, 1924'}

请查看将附加数据传递给回调函数以及Response.follow以获取更多信息，这些信息可以在Scrapy文档中找到。

英文:

You are actually really close to having it right. Just a couple of things you are missing and 1 thing that needs to be moved.

response.follow returns a request object so unless you yield that request object it will never be dispatched from the scrapy engine.
When passing objects from one callback function to another you should use the cb_kwargs parameter. Using the meta dictionary works too, but scrapy officially prefers using cb_kwargs. however simply passing it as a positional argument will not work.
a dict is mutable, this includes when they are used as scrapy items. So when you are creating scrapy items, each individual item should be unique. Otherwise when you go to update that item later you might end up mutating previously yielded items.

Here is an example that uses your code but implements the three points I made above.

class QuotesSpider(scrapy.Spider):
    name = &quot;quotes&quot;
    allowed_domains = [&quot;quotes.toscrape.com&quot;]
    start_urls = [&quot;http://quotes.toscrape.com/&quot;]

    def parse(self, response):
        for quote in response.xpath(&#39;//div[@class=&quot;quote&quot;]&#39;):
            # moving the item constructor inside the loop 
            # means it will be unique for each item
            item={}   

            item[&#39;name&#39;]=quote.xpath(&#39;.//span[@class=&quot;text&quot;]/text()&#39;).get()
            item[&#39;author&#39;]=quote.xpath(&#39;.//small[@class=&quot;author&quot;]/text()&#39;).get()
            item[&#39;tags&#39;]=quote.xpath(&#39;.//div[@class=&quot;tags&quot;]/a[@class=&quot;tag&quot;]/text()&#39;).getall()
            url=quote.xpath(&#39;.//small[@class=&quot;author&quot;]/../a/@href&#39;).get()
            # you have to yield the request returned by response.follow
            yield response.follow(url, self.parse_additional_page, cb_kwargs={&quot;item&quot;: item})
        new_page=response.xpath(&#39;//li[@class=&quot;next&quot;]/a/@href&#39;).get()
        if new_page is not None:
            yield response.follow(new_page)

    def parse_additional_page(self, response, item=None):
        item[&#39;additional_data&#39;] = response.xpath(&#39;//span[@class=&quot;author-born-date&quot;]/text()&#39;).get()
        yield item

Partial Output:

2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from &lt;200 http://quotes.toscrape.com/author/Martin-Luther-King-Jr/&gt;
{&#39;name&#39;: &#39;“Only in the darkness can you see the stars.”&#39;, &#39;author&#39;: &#39;Martin Luther King Jr.&#39;, &#39;tags&#39;: [&#39;hope&#39;, &#39;inspirational&#39;], &#39;additional_data&#39;: &#39;January 15, 1929&#39;}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from &lt;200 http://quotes.toscrape.com/author/C-S-Lewis/&gt;
{&#39;name&#39;: &#39;“You can never get a cup of tea large enough or a book long enough to suit me.”&#39;, &#39;author&#39;: &#39;C.S. Lewis&#39;, &#39;tags&#39;: [&#39;books&#39;, &#39;inspirational&#39;, &#39;reading&#39;, &#39;tea&#39;], &#39;additional_data&#39;: &#39;November 29, 1898&#39;}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from &lt;200 http://quotes.toscrape.com/author/George-R-R-Martin/&gt;
{&#39;name&#39;: &#39;“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”&#39;, &#39;author&#39;: &#39;George R.R. Martin&#39;, &#39;tags&#39;: [&#39;read&#39;, &#39;readers&#39;, &#39;reading&#39;, &#39;reading-books&#39;], &#39;additional_data&#39;: &#39;
September 20, 1948&#39;}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from &lt;200 http://quotes.toscrape.com/author/James-Baldwin/&gt;
{&#39;name&#39;: &#39;“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”&#39;, &#39;author&#39;: &#39;James Baldwin&#39;, &#39;tags&#39;: [&#39;love&#39;], &#39;additional_data&#39;: &#39;August 02, 1924&#39;}

Check out Passing additional data to callback functions and Response.follow found in the scrapy docs for more information.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Scrapy和Python解析

问题

答案1

为什么pandas中的`mean`在处理Series时有效，但在处理GroupBy对象时无效？

排序 pd.DataFrame

Django PostgreSQL “无需应用的迁移” 故障排除

Named shared memory between C++ and python on Windows

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论