2023年6月12日 21:12:14go评论87阅读模式

英文:

scrapy output missing row in from the page

问题

页面上有10条引用，我将它们放入一个列表中，显示了全部10条。

但是当我运行代码来抓取它时，其中一条引用在输出中丢失，所以只有9行数据。

（注意）我注意到丢失的那一条引用是与同一位（作者）相关的，不确定是否与此有关。

正在抓取的页面：https://quotes.toscrape.com/page/4
其他页面也出现了相同的情况

我有两个函数，一个用于抓取URL和引用的一些基本信息，然后跟随这些URL来抓取有关作者的数据并创建一个字典。

以下是代码：

def parse(self, response):
    qs = response.css('.quote')
    for q in qs:
        n = {}
        page_url = q.css('span a').attrib['href']
        full_page_url = 'https://quotes.toscrape.com' + page_url
        # 标签
        t = []
        tags = q.css('.tag')
        for tag in tags:
            t.append(tag.css('::text').get())
        # 项目
        n['quote'] = q.css('.text ::text').get(),
        n['tag'] = t,
        n['author'] = q.css('span .author ::text').get()
        yield response.follow(full_page_url, callback=self.parse_page, meta={'item': n})
def parse_page(self, response):
    q = response.css('.author-details')
    item = response.meta.get('item')
    yield {
        'text': item['quote'],
        'author': item['author'],
        'tags': item['tag'],
        'date': q.css('p .author-born-date ::text').get(),
        'location': q.css('p .author-born-location ::text').get(),
    }

我还尝试使用项目（Scrapy字段）同样的结果，以及尝试调试和从第一个函数打印数据，丢失的行在那里显示，但未发送到第二个函数。所以我尝试了不同的发送字典到第一个信息的方法，我尝试了cb_kwargs：

yield response.follow(full_page_url, callback=self.parse_page, cb_kwargs={'item': n})

英文:

page has 10 quotes i put them into a list its show all 10 .

but when i run the code to scrape it one quote is missing from output so there is only 9 rows of data .

( note ) i noticed that the one missing is one where the quote from same ( author ) not sure if that has anything to do with it .

page being scraped : https://quotes.toscrape.com/page/4
same happens with other pages

i have 2 functions one scrapes URLs and some basic info about the quote then follows that URLs to scrape data on the author and create a dict there .

code :

def parse(self, response):
    qs = response.css(&#39;.quote&#39;)
    for q in qs:
        n = {}
        page_url = q.css(&#39;span a&#39;).attrib[&#39;href&#39;]
        full_page_url = &#39;https://quotes.toscrape.com&#39; + page_url
        # tags
        t = []
        tags = q.css(&#39;.tag&#39;)
        for tag in tags:
            t.append(tag.css(&#39;::text&#39;).get())
        # items
        n[&#39;quote&#39;] = q.css(&#39;.text ::text&#39;).get(),
        n[&#39;tag&#39;] = t,
        n[&#39;author&#39;] = q.css(&#39;span .author ::text&#39;).get(),
        yield response.follow(full_page_url, callback=self.parse_page, meta={&#39;item&#39;: n})
def parse_page(self, response):
    q = response.css(&#39;.author-details&#39;)
    item = response.meta.get(&#39;item&#39;)
    yield {
        &#39;text&#39;: item[&#39;quote&#39;],
        &#39;author&#39;: item[&#39;author&#39;],
        &#39;tags&#39;: item[&#39;tag&#39;],
        &#39;date&#39;: q.css(&#39;p .author-born-date ::text&#39;).get(),
        &#39;location&#39;:  q.css(&#39;p .author-born-location ::text&#39;).get(),
    }

i also tried using items ( scrapy fields) same thing

and i tried debugging and print data from first function the missing row showes there but
it doesn't get sent to second function .

so i triend diffrent methods of sending dict with first info the the second one .
i tried cb_kwargs :
yield response.follow(full_page_url, callback=self.parse_page, cb_kwargs={'item':n})

答案1

得分: 0

Scrapy内置了重复过滤器，它会自动忽略重复的URL，因此当您有来自同一作者的两个引用时，这两个引用都会指向作者详细信息的相同URL，这意味着当它到达URL的第二次出现时，它会忽略该请求，该项永远不会被产生到输出feed处理器中。

您可以通过在请求中将dont_filter参数设置为True来解决这个问题。

例如：

def parse(self, response):
    for q in response.css('.quote'):
        n = {}
        n["tags"] = q.css('.tag::text').getall()
        n['quote'] = q.css('.text ::text').get().strip()
        n['author'] = q.css('span .author ::text').get().strip()
        page_url = q.css('span a').attrib['href']
        yield response.follow(page_url, callback=self.parse_page, meta={'item': n}, dont_filter=True)
def parse_page(self, response):
    q = response.css('.author-details')
    item = response.meta.get('item')
    item["date"] = q.css('p .author-born-date ::text').get()
    item["location"] = q.css('p .author-born-location ::text').get()
    yield item

英文:

Scrapy has a built in duplicate filter, which automatically ignores duplicate urls, so when you have two quotes from the same author, both of those quotes target the same url for the author details, which means when it reaches the second occurence of the url it ignores that request and that item is never yielded to the output feed processors.

You can fix this by setting the dont_filter parameter to True in your requests.

For example:

def parse(self, response):
    for q in response.css(&#39;.quote&#39;):
        n = {}
        n[&quot;tags&quot;] = q.css(&#39;.tag::text&#39;).getall()
        n[&#39;quote&#39;] = q.css(&#39;.text ::text&#39;).get().strip()
        n[&#39;author&#39;] = q.css(&#39;span .author ::text&#39;).get().strip()
        page_url = q.css(&#39;span a&#39;).attrib[&#39;href&#39;]
        yield response.follow(page_url, callback=self.parse_page, meta={&#39;item&#39;: n}, dont_filter=True)
def parse_page(self, response):
    q = response.css(&#39;.author-details&#39;)
    item = response.meta.get(&#39;item&#39;)
    item[&quot;date&quot;] = q.css(&#39;p .author-born-date ::text&#39;).get()
    item[&quot;location&quot;] = q.css(&#39;p .author-born-location ::text&#39;).get()
    yield item

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Scrapy输出页面中缺少的行。

问题

答案1

翻译结果：Jit是一个从字典中选择函数的JAX函数。

Go：将地图值替换为编辑

转换数组 [‘xx=yy’] 为 xx=yy 的映射。

Python web-scraping issue: 无法从特定URL检索正文部分数据

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。