英文:
scrapy output missing row in from the page
问题
页面上有10条引用,我将它们放入一个列表中,显示了全部10条。
但是当我运行代码来抓取它时,其中一条引用在输出中丢失,所以只有9行数据。
(注意)我注意到丢失的那一条引用是与同一位(作者)相关的,不确定是否与此有关。
正在抓取的页面:https://quotes.toscrape.com/page/4
其他页面也出现了相同的情况
我有两个函数,一个用于抓取URL和引用的一些基本信息,然后跟随这些URL来抓取有关作者的数据并创建一个字典。
以下是代码:
def parse(self, response):
qs = response.css('.quote')
for q in qs:
n = {}
page_url = q.css('span a').attrib['href']
full_page_url = 'https://quotes.toscrape.com' + page_url
# 标签
t = []
tags = q.css('.tag')
for tag in tags:
t.append(tag.css('::text').get())
# 项目
n['quote'] = q.css('.text ::text').get(),
n['tag'] = t,
n['author'] = q.css('span .author ::text').get()
yield response.follow(full_page_url, callback=self.parse_page, meta={'item': n})
def parse_page(self, response):
q = response.css('.author-details')
item = response.meta.get('item')
yield {
'text': item['quote'],
'author': item['author'],
'tags': item['tag'],
'date': q.css('p .author-born-date ::text').get(),
'location': q.css('p .author-born-location ::text').get(),
}
我还尝试使用项目(Scrapy字段)同样的结果,以及尝试调试和从第一个函数打印数据,丢失的行在那里显示,但未发送到第二个函数。所以我尝试了不同的发送字典到第一个信息的方法,我尝试了cb_kwargs:
yield response.follow(full_page_url, callback=self.parse_page, cb_kwargs={'item': n})
英文:
page has 10 quotes i put them into a list its show all 10 .
but when i run the code to scrape it one quote is missing from output so there is only 9 rows of data .
( note ) i noticed that the one missing is one where the quote from same ( author ) not sure if that has anything to do with it .
page being scraped : https://quotes.toscrape.com/page/4
same happens with other pages
i have 2 functions one scrapes URLs and some basic info about the quote then follows that URLs to scrape data on the author and create a dict there .
code :
def parse(self, response):
qs = response.css('.quote')
for q in qs:
n = {}
page_url = q.css('span a').attrib['href']
full_page_url = 'https://quotes.toscrape.com' + page_url
# tags
t = []
tags = q.css('.tag')
for tag in tags:
t.append(tag.css('::text').get())
# items
n['quote'] = q.css('.text ::text').get(),
n['tag'] = t,
n['author'] = q.css('span .author ::text').get(),
yield response.follow(full_page_url, callback=self.parse_page, meta={'item': n})
def parse_page(self, response):
q = response.css('.author-details')
item = response.meta.get('item')
yield {
'text': item['quote'],
'author': item['author'],
'tags': item['tag'],
'date': q.css('p .author-born-date ::text').get(),
'location': q.css('p .author-born-location ::text').get(),
}
i also tried using items ( scrapy fields) same thing
and i tried debugging and print data from first function the missing row showes there but
it doesn't get sent to second function .
so i triend diffrent methods of sending dict with first info the the second one .
i tried cb_kwargs :
yield response.follow(full_page_url, callback=self.parse_page, cb_kwargs={'item':n})
答案1
得分: 0
Scrapy内置了重复过滤器,它会自动忽略重复的URL,因此当您有来自同一作者的两个引用时,这两个引用都会指向作者详细信息的相同URL,这意味着当它到达URL的第二次出现时,它会忽略该请求,该项永远不会被产生到输出feed处理器中。
您可以通过在请求中将dont_filter
参数设置为True
来解决这个问题。
例如:
def parse(self, response):
for q in response.css('.quote'):
n = {}
n["tags"] = q.css('.tag::text').getall()
n['quote'] = q.css('.text ::text').get().strip()
n['author'] = q.css('span .author ::text').get().strip()
page_url = q.css('span a').attrib['href']
yield response.follow(page_url, callback=self.parse_page, meta={'item': n}, dont_filter=True)
def parse_page(self, response):
q = response.css('.author-details')
item = response.meta.get('item')
item["date"] = q.css('p .author-born-date ::text').get()
item["location"] = q.css('p .author-born-location ::text').get()
yield item
英文:
Scrapy has a built in duplicate filter, which automatically ignores duplicate urls, so when you have two quotes from the same author, both of those quotes target the same url for the author details, which means when it reaches the second occurence of the url it ignores that request and that item is never yielded to the output feed processors.
You can fix this by setting the dont_filter
parameter to True
in your requests.
For example:
def parse(self, response):
for q in response.css('.quote'):
n = {}
n["tags"] = q.css('.tag::text').getall()
n['quote'] = q.css('.text ::text').get().strip()
n['author'] = q.css('span .author ::text').get().strip()
page_url = q.css('span a').attrib['href']
yield response.follow(page_url, callback=self.parse_page, meta={'item': n}, dont_filter=True)
def parse_page(self, response):
q = response.css('.author-details')
item = response.meta.get('item')
item["date"] = q.css('p .author-born-date ::text').get()
item["location"] = q.css('p .author-born-location ::text').get()
yield item
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论