Scrapy和Python解析

huangapple go评论57阅读模式
英文:

Scrapy and Python parsing

问题

To go to the author's page for each quote and parse the date of birth, you can modify your Scrapy spider as follows:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        quotes = response.xpath('//div[@class="quote"]')

        for quote in quotes:
            item = {}
            item['name'] = quote.xpath('.//span[@class="text"]/text()').get()
            item['author'] = quote.xpath('.//small[@class="author"]/text()').get()
            item['tags'] = quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
            
            # Extract the author's page URL and follow it
            author_url = quote.xpath('.//small[@class="author"]/../a/@href').get()
            yield response.follow(author_url, self.parse_author_page, meta={'item': item})

        new_page = response.xpath('//li[@class="next"]/a/@href').get()
        if new_page is not None:
            yield response.follow(new_page, self.parse)

    def parse_author_page(self, response):
        item = response.meta['item']
        item['date_of_birth'] = response.xpath('//span[@class="author-born-date"]/text()').get()
        yield item

This code adds a new method parse_author_page that follows the author's page URL for each quote and extracts the date of birth. The meta attribute is used to pass the item between the main parse method and the parse_author_page method.

英文:

I'm learning Scrapy. For example, there is a website http://quotes.toscrape.com .
I'm creating a simple spider (scrapy genspider quotes).
I want to parse quotes, as well as go to the author's page and parse his date of birth.
I'm trying to do it this way, but nothing works.

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        
        quotes=response.xpath('//div[@class="quote"]') 
        
        item={}

        for quote in quotes: 
            item['name']=quote.xpath('.//span[@class="text"]/text()').get()
            item['author']=quote.xpath('.//small[@class="author"]/text()').get()
            item['tags']=quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
            url=quote.xpath('.//small[@class="author"]/../a/@href').get()
            response.follow(url, self.parse_additional_page, item) 
            

        new_page=response.xpath('//li[@class="next"]/a/@href').get() 

        if new_page is not None: 

            yield response.follow(new_page,self.parse) 
            
    def parse_additional_page(self, response, item): 
        item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get() 
        yield item
            

Code without date of birth (is correct):

import scrapy 

  

  

class QuotesSpiderSpider(scrapy.Spider): 

    name = "quotes_spider" 

    allowed_domains = ["quotes.toscrape.com"] 

    start_urls = ["https://quotes.toscrape.com/"] 

     

    def parse(self, response): 

        quotes=response.xpath('//div[@class="quote"]') 

        for quote in quotes: 

            yield { 

                'name':quote.xpath('.//span[@class="text"]/text()').get(), 

                'author':quote.xpath('.//small[@class="author"]/text()').get(), 

                'tags':quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall() 

                } 

        new_page=response.xpath('//li[@class="next"]/a/@href').get() 

        if new_page is not None: 

            yield response.follow(new_page,self.parse) 

Question: how to go to the author's page for each quote and parse the date of birth?

How to go to the author's page for each quote and parse the date of birth?

答案1

得分: 1

以下是您提供的代码的翻译部分:

你实际上离正确的方法非常接近只有一些您遗漏的地方和需要移动的一件事情

1. `response.follow` 返回一个请求对象所以除非您`yield`该请求对象否则它永远不会从Scrapy引擎中派发

2. 当从一个回调函数传递对象到另一个回调函数时应该使用`cb_kwargs`参数使用`meta`字典也可以但Scrapy官方更倾向于使用`cb_kwargs`。然而简单地将其作为位置参数传递是行不通的

3. 字典是可变的包括当它们用作Scrapy项目时因此当您创建Scrapy项目时每个单独的项目应该是唯一的否则当您稍后更新该项目时可能会导致变异先前产生的项目

以下是一个使用您的代码但实现了我上面提到的三个要点的示例

```python
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            # 将项目构造函数移到循环内部意味着每个项目都将是唯一的
            item = {}

            item['name'] = quote.xpath('.//span[@class="text"]/text()').get()
            item['author'] = quote.xpath('.//small[@class="author"]/text()').get()
            item['tags'] = quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
            url = quote.xpath('.//small[@class="author"]/../a/@href').get()
            # 您必须yield response.follow返回的请求
            yield response.follow(url, self.parse_additional_page, cb_kwargs={"item": item})
        new_page = response.xpath('//li[@class="next"]/a/@href').get()
        if new_page is not None:
            yield response.follow(new_page)

    def parse_additional_page(self, response, item=None):
        item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get()
        yield item

部分输出:

2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Martin-Luther-King-Jr/>
{'name': '“Only in the darkness can you see the stars.”', 'author': 'Martin Luther King Jr.', 'tags': ['hope', 'inspirational'], 'additional_data': 'January 15, 1929'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/C-S-Lewis/>
{'name': '“You can never get a cup of tea large enough or a book long enough to suit me.”', 'author': 'C.S. Lewis', 'tags': ['books', 'inspirational', 'reading', 'tea'], 'additional_data': 'November 29, 1898'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/George-R-R-Martin/>
{'name': '“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”', 'author': 'George R.R. Martin', 'tags': ['read', 'readers', 'reading', 'reading-books'], 'additional_data': 'September 20, 1948'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/James-Baldwin/>
{'name': '“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”', 'author': 'James Baldwin', 'tags': ['love'], 'additional_data': 'August 02, 1924'}

请查看将附加数据传递给回调函数以及Response.follow以获取更多信息,这些信息可以在Scrapy文档中找到。

英文:

You are actually really close to having it right. Just a couple of things you are missing and 1 thing that needs to be moved.

  1. response.follow returns a request object so unless you yield that request object it will never be dispatched from the scrapy engine.

  2. When passing objects from one callback function to another you should use the cb_kwargs parameter. Using the meta dictionary works too, but scrapy officially prefers using cb_kwargs. however simply passing it as a positional argument will not work.

  3. a dict is mutable, this includes when they are used as scrapy items. So when you are creating scrapy items, each individual item should be unique. Otherwise when you go to update that item later you might end up mutating previously yielded items.

Here is an example that uses your code but implements the three points I made above.

class QuotesSpider(scrapy.Spider):
    name = &quot;quotes&quot;
    allowed_domains = [&quot;quotes.toscrape.com&quot;]
    start_urls = [&quot;http://quotes.toscrape.com/&quot;]

    def parse(self, response):
        for quote in response.xpath(&#39;//div[@class=&quot;quote&quot;]&#39;):
            # moving the item constructor inside the loop 
            # means it will be unique for each item
            item={}   

            item[&#39;name&#39;]=quote.xpath(&#39;.//span[@class=&quot;text&quot;]/text()&#39;).get()
            item[&#39;author&#39;]=quote.xpath(&#39;.//small[@class=&quot;author&quot;]/text()&#39;).get()
            item[&#39;tags&#39;]=quote.xpath(&#39;.//div[@class=&quot;tags&quot;]/a[@class=&quot;tag&quot;]/text()&#39;).getall()
            url=quote.xpath(&#39;.//small[@class=&quot;author&quot;]/../a/@href&#39;).get()
            # you have to yield the request returned by response.follow
            yield response.follow(url, self.parse_additional_page, cb_kwargs={&quot;item&quot;: item})
        new_page=response.xpath(&#39;//li[@class=&quot;next&quot;]/a/@href&#39;).get()
        if new_page is not None:
            yield response.follow(new_page)

    def parse_additional_page(self, response, item=None):
        item[&#39;additional_data&#39;] = response.xpath(&#39;//span[@class=&quot;author-born-date&quot;]/text()&#39;).get()
        yield item

Partial Output:

2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from &lt;200 http://quotes.toscrape.com/author/Martin-Luther-King-Jr/&gt;
{&#39;name&#39;: &#39;“Only in the darkness can you see the stars.”&#39;, &#39;author&#39;: &#39;Martin Luther King Jr.&#39;, &#39;tags&#39;: [&#39;hope&#39;, &#39;inspirational&#39;], &#39;additional_data&#39;: &#39;January 15, 1929&#39;}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from &lt;200 http://quotes.toscrape.com/author/C-S-Lewis/&gt;
{&#39;name&#39;: &#39;“You can never get a cup of tea large enough or a book long enough to suit me.”&#39;, &#39;author&#39;: &#39;C.S. Lewis&#39;, &#39;tags&#39;: [&#39;books&#39;, &#39;inspirational&#39;, &#39;reading&#39;, &#39;tea&#39;], &#39;additional_data&#39;: &#39;November 29, 1898&#39;}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from &lt;200 http://quotes.toscrape.com/author/George-R-R-Martin/&gt;
{&#39;name&#39;: &#39;“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”&#39;, &#39;author&#39;: &#39;George R.R. Martin&#39;, &#39;tags&#39;: [&#39;read&#39;, &#39;readers&#39;, &#39;reading&#39;, &#39;reading-books&#39;], &#39;additional_data&#39;: &#39;
September 20, 1948&#39;}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from &lt;200 http://quotes.toscrape.com/author/James-Baldwin/&gt;
{&#39;name&#39;: &#39;“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”&#39;, &#39;author&#39;: &#39;James Baldwin&#39;, &#39;tags&#39;: [&#39;love&#39;], &#39;additional_data&#39;: &#39;August 02, 1924&#39;}

Check out Passing additional data to callback functions and Response.follow found in the scrapy docs for more information.

huangapple
  • 本文由 发表于 2023年5月11日 04:20:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76222278.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定