英文:
Scrapy and Python parsing
问题
To go to the author's page for each quote and parse the date of birth, you can modify your Scrapy spider as follows:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com/"]
def parse(self, response):
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
item = {}
item['name'] = quote.xpath('.//span[@class="text"]/text()').get()
item['author'] = quote.xpath('.//small[@class="author"]/text()').get()
item['tags'] = quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
# Extract the author's page URL and follow it
author_url = quote.xpath('.//small[@class="author"]/../a/@href').get()
yield response.follow(author_url, self.parse_author_page, meta={'item': item})
new_page = response.xpath('//li[@class="next"]/a/@href').get()
if new_page is not None:
yield response.follow(new_page, self.parse)
def parse_author_page(self, response):
item = response.meta['item']
item['date_of_birth'] = response.xpath('//span[@class="author-born-date"]/text()').get()
yield item
This code adds a new method parse_author_page
that follows the author's page URL for each quote and extracts the date of birth. The meta
attribute is used to pass the item between the main parse method and the parse_author_page
method.
英文:
I'm learning Scrapy. For example, there is a website http://quotes.toscrape.com .
I'm creating a simple spider (scrapy genspider quotes).
I want to parse quotes, as well as go to the author's page and parse his date of birth.
I'm trying to do it this way, but nothing works.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com/"]
def parse(self, response):
quotes=response.xpath('//div[@class="quote"]')
item={}
for quote in quotes:
item['name']=quote.xpath('.//span[@class="text"]/text()').get()
item['author']=quote.xpath('.//small[@class="author"]/text()').get()
item['tags']=quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
url=quote.xpath('.//small[@class="author"]/../a/@href').get()
response.follow(url, self.parse_additional_page, item)
new_page=response.xpath('//li[@class="next"]/a/@href').get()
if new_page is not None:
yield response.follow(new_page,self.parse)
def parse_additional_page(self, response, item):
item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get()
yield item
Code without date of birth (is correct):
import scrapy
class QuotesSpiderSpider(scrapy.Spider):
name = "quotes_spider"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["https://quotes.toscrape.com/"]
def parse(self, response):
quotes=response.xpath('//div[@class="quote"]')
for quote in quotes:
yield {
'name':quote.xpath('.//span[@class="text"]/text()').get(),
'author':quote.xpath('.//small[@class="author"]/text()').get(),
'tags':quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
}
new_page=response.xpath('//li[@class="next"]/a/@href').get()
if new_page is not None:
yield response.follow(new_page,self.parse)
Question: how to go to the author's page for each quote and parse the date of birth?
How to go to the author's page for each quote and parse the date of birth?
答案1
得分: 1
以下是您提供的代码的翻译部分:
你实际上离正确的方法非常接近。只有一些您遗漏的地方和需要移动的一件事情。
1. `response.follow` 返回一个请求对象,所以除非您`yield`该请求对象,否则它永远不会从Scrapy引擎中派发。
2. 当从一个回调函数传递对象到另一个回调函数时,应该使用`cb_kwargs`参数。使用`meta`字典也可以,但Scrapy官方更倾向于使用`cb_kwargs`。然而,简单地将其作为位置参数传递是行不通的。
3. 字典是可变的,包括当它们用作Scrapy项目时。因此,当您创建Scrapy项目时,每个单独的项目应该是唯一的。否则,当您稍后更新该项目时,可能会导致变异先前产生的项目。
以下是一个使用您的代码但实现了我上面提到的三个要点的示例:
```python
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com/"]
def parse(self, response):
for quote in response.xpath('//div[@class="quote"]'):
# 将项目构造函数移到循环内部意味着每个项目都将是唯一的
item = {}
item['name'] = quote.xpath('.//span[@class="text"]/text()').get()
item['author'] = quote.xpath('.//small[@class="author"]/text()').get()
item['tags'] = quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
url = quote.xpath('.//small[@class="author"]/../a/@href').get()
# 您必须yield response.follow返回的请求
yield response.follow(url, self.parse_additional_page, cb_kwargs={"item": item})
new_page = response.xpath('//li[@class="next"]/a/@href').get()
if new_page is not None:
yield response.follow(new_page)
def parse_additional_page(self, response, item=None):
item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get()
yield item
部分输出:
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Martin-Luther-King-Jr/>
{'name': '“Only in the darkness can you see the stars.”', 'author': 'Martin Luther King Jr.', 'tags': ['hope', 'inspirational'], 'additional_data': 'January 15, 1929'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/C-S-Lewis/>
{'name': '“You can never get a cup of tea large enough or a book long enough to suit me.”', 'author': 'C.S. Lewis', 'tags': ['books', 'inspirational', 'reading', 'tea'], 'additional_data': 'November 29, 1898'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/George-R-R-Martin/>
{'name': '“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”', 'author': 'George R.R. Martin', 'tags': ['read', 'readers', 'reading', 'reading-books'], 'additional_data': 'September 20, 1948'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/James-Baldwin/>
{'name': '“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”', 'author': 'James Baldwin', 'tags': ['love'], 'additional_data': 'August 02, 1924'}
请查看将附加数据传递给回调函数以及Response.follow以获取更多信息,这些信息可以在Scrapy文档中找到。
英文:
You are actually really close to having it right. Just a couple of things you are missing and 1 thing that needs to be moved.
-
response.follow
returns a request object so unless youyield
that request object it will never be dispatched from the scrapy engine. -
When passing objects from one callback function to another you should use the
cb_kwargs
parameter. Using themeta
dictionary works too, but scrapy officially prefers usingcb_kwargs
. however simply passing it as a positional argument will not work. -
a
dict
is mutable, this includes when they are used as scrapy items. So when you are creating scrapy items, each individual item should be unique. Otherwise when you go to update that item later you might end up mutating previously yielded items.
Here is an example that uses your code but implements the three points I made above.
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com/"]
def parse(self, response):
for quote in response.xpath('//div[@class="quote"]'):
# moving the item constructor inside the loop
# means it will be unique for each item
item={}
item['name']=quote.xpath('.//span[@class="text"]/text()').get()
item['author']=quote.xpath('.//small[@class="author"]/text()').get()
item['tags']=quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
url=quote.xpath('.//small[@class="author"]/../a/@href').get()
# you have to yield the request returned by response.follow
yield response.follow(url, self.parse_additional_page, cb_kwargs={"item": item})
new_page=response.xpath('//li[@class="next"]/a/@href').get()
if new_page is not None:
yield response.follow(new_page)
def parse_additional_page(self, response, item=None):
item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get()
yield item
Partial Output:
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Martin-Luther-King-Jr/>
{'name': '“Only in the darkness can you see the stars.”', 'author': 'Martin Luther King Jr.', 'tags': ['hope', 'inspirational'], 'additional_data': 'January 15, 1929'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/C-S-Lewis/>
{'name': '“You can never get a cup of tea large enough or a book long enough to suit me.”', 'author': 'C.S. Lewis', 'tags': ['books', 'inspirational', 'reading', 'tea'], 'additional_data': 'November 29, 1898'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/George-R-R-Martin/>
{'name': '“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”', 'author': 'George R.R. Martin', 'tags': ['read', 'readers', 'reading', 'reading-books'], 'additional_data': '
September 20, 1948'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/James-Baldwin/>
{'name': '“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”', 'author': 'James Baldwin', 'tags': ['love'], 'additional_data': 'August 02, 1924'}
Check out Passing additional data to callback functions and Response.follow
found in the scrapy docs for more information.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论