Scrapy输出页面中缺少的行。

huangapple go评论59阅读模式
英文:

scrapy output missing row in from the page

问题

页面上有10条引用,我将它们放入一个列表中,显示了全部10条。

但是当我运行代码来抓取它时,其中一条引用在输出中丢失,所以只有9行数据。

(注意)我注意到丢失的那一条引用是与同一位(作者)相关的,不确定是否与此有关。

正在抓取的页面:https://quotes.toscrape.com/page/4
其他页面也出现了相同的情况

我有两个函数,一个用于抓取URL和引用的一些基本信息,然后跟随这些URL来抓取有关作者的数据并创建一个字典。

以下是代码:

def parse(self, response):
    qs = response.css('.quote')
    for q in qs:
        n = {}
        page_url = q.css('span a').attrib['href']
        full_page_url = 'https://quotes.toscrape.com' + page_url

        # 标签
        t = []
        tags = q.css('.tag')
        for tag in tags:
            t.append(tag.css('::text').get())

        # 项目
        n['quote'] = q.css('.text ::text').get(),
        n['tag'] = t,
        n['author'] = q.css('span .author ::text').get()
        yield response.follow(full_page_url, callback=self.parse_page, meta={'item': n})

def parse_page(self, response):
    q = response.css('.author-details')
    item = response.meta.get('item')
    yield {
        'text': item['quote'],
        'author': item['author'],
        'tags': item['tag'],
        'date': q.css('p .author-born-date ::text').get(),
        'location': q.css('p .author-born-location ::text').get(),
    }

我还尝试使用项目(Scrapy字段)同样的结果,以及尝试调试和从第一个函数打印数据,丢失的行在那里显示,但未发送到第二个函数。所以我尝试了不同的发送字典到第一个信息的方法,我尝试了cb_kwargs:

yield response.follow(full_page_url, callback=self.parse_page, cb_kwargs={'item': n})
英文:

page has 10 quotes i put them into a list its show all 10 .

but when i run the code to scrape it one quote is missing from output so there is only 9 rows of data .

( note ) i noticed that the one missing is one where the quote from same ( author ) not sure if that has anything to do with it .

page being scraped : https://quotes.toscrape.com/page/4
same happens with other pages

i have 2 functions one scrapes URLs and some basic info about the quote then follows that URLs to scrape data on the author and create a dict there .

code :

def parse(self, response):
    qs = response.css('.quote')
    for q in qs:
        n = {}
        page_url = q.css('span a').attrib['href']
        full_page_url = 'https://quotes.toscrape.com' + page_url

        # tags
        t = []
        tags = q.css('.tag')
        for tag in tags:
            t.append(tag.css('::text').get())

        # items
        n['quote'] = q.css('.text ::text').get(),
        n['tag'] = t,
        n['author'] = q.css('span .author ::text').get(),
        yield response.follow(full_page_url, callback=self.parse_page, meta={'item': n})



def parse_page(self, response):
    q = response.css('.author-details')
    item = response.meta.get('item')
    yield {
        'text': item['quote'],
        'author': item['author'],
        'tags': item['tag'],
        'date': q.css('p .author-born-date ::text').get(),
        'location':  q.css('p .author-born-location ::text').get(),
    }

i also tried using items ( scrapy fields) same thing

and i tried debugging and print data from first function the missing row showes there but
it doesn't get sent to second function .

so i triend diffrent methods of sending dict with first info the the second one .
i tried cb_kwargs :
yield response.follow(full_page_url, callback=self.parse_page, cb_kwargs={'item':n})

答案1

得分: 0

Scrapy内置了重复过滤器,它会自动忽略重复的URL,因此当您有来自同一作者的两个引用时,这两个引用都会指向作者详细信息的相同URL,这意味着当它到达URL的第二次出现时,它会忽略该请求,该项永远不会被产生到输出feed处理器中。

您可以通过在请求中将dont_filter参数设置为True来解决这个问题。

例如:

def parse(self, response):
    for q in response.css('.quote'):
        n = {}
        n["tags"] = q.css('.tag::text').getall()
        n['quote'] = q.css('.text ::text').get().strip()
        n['author'] = q.css('span .author ::text').get().strip()
        page_url = q.css('span a').attrib['href']
        yield response.follow(page_url, callback=self.parse_page, meta={'item': n}, dont_filter=True)

def parse_page(self, response):
    q = response.css('.author-details')
    item = response.meta.get('item')
    item["date"] = q.css('p .author-born-date ::text').get()
    item["location"] = q.css('p .author-born-location ::text').get()
    yield item
英文:

Scrapy has a built in duplicate filter, which automatically ignores duplicate urls, so when you have two quotes from the same author, both of those quotes target the same url for the author details, which means when it reaches the second occurence of the url it ignores that request and that item is never yielded to the output feed processors.

You can fix this by setting the dont_filter parameter to True in your requests.

For example:

def parse(self, response):
    for q in response.css('.quote'):
        n = {}
        n["tags"] = q.css('.tag::text').getall()
        n['quote'] = q.css('.text ::text').get().strip()
        n['author'] = q.css('span .author ::text').get().strip()
        page_url = q.css('span a').attrib['href']
        yield response.follow(page_url, callback=self.parse_page, meta={'item': n}, dont_filter=True)



def parse_page(self, response):
    q = response.css('.author-details')
    item = response.meta.get('item')
    item["date"] = q.css('p .author-born-date ::text').get()
    item["location"] = q.css('p .author-born-location ::text').get()
    yield item

huangapple
  • 本文由 发表于 2023年6月12日 21:12:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76457021.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定