Scrapy回调在使用Playwright进行JavaScript渲染时未执行。

huangapple go评论75阅读模式
英文:

Scrapy callback not executed when using Playwright for JavaScript rendering

问题

这是您提供的Python代码,您想让我翻译它吗?如果是的话,请指定要翻译的部分。

英文:

I'm using Scrapy with the Playwright plugin to crawl a website that relies on JavaScript for rendering.
My spider includes two asynchronous functions, parse_categories and parse_product_page.

The parse_categories function checks for categories in the URL and sends requests to the parse_categories callback again until a product page is found which should be when no categories are found.
If no categories are found, it should send a request to the parse_product_page callback.

However, when it reaches the else block in parse_categories, it seems that the request to parse_product_page is never made. I've confirmed that the code enters the else block, but the print statement in the parse_product_page function is never reached.

Here is my reprex:

import scrapy
from scrapy_playwright.page import PageMethod

class Spider():
    name = "quotes"
    allowed_domains = ['quotes.toscrape.com']
  
    def start_requests(self):
        yield scrapy.Request(url='https://quotes.toscrape.com/js/', callback=self.parse_urls, 
              meta=dict(
                   playwright = True, 
                   playwright_include_page = True,
                   playwright_page_methods = [
                         PageMethod('wait_for_selector','body > div > nav > ul > li > a')
                        ],
                   ))
    

    async def parse_urls(self, response):
        page = response.meta['playwright_page']
        await page.close()
        
        next_page_url = response.xpath('//li[@class="next"]/a/@href').get()

        if next_page_url:
            print("Inside if block")
            url = 'https://quotes.toscrape.com' + next_page_url
            yield scrapy.Request(url=url,callback=self.parse_urls,
                meta=dict(
                    playwright = True,
                    playwright_include_page = True,
                    playwright_page_methods = [
                        PageMethod('wait_for_selector','body > div > div.quote')]
                        ))
        else:
            print("Next page link not found")
            yield scrapy.Request(url=response.request.url, callback=self.parse, 
                    meta=dict(
                        playwright = True,
                        playwright_include_page = True,
                        playwright_page_methods = [
                            PageMethod('wait_for_selector','body > div > div.quote')]
                        ))


    async def parse(self,response):
        page = response.meta['playwright_page']
        await page.close()
        print("Function has been called, because next page link not found")

This is the logs from the reprex:

Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Next page link not found
2023-04-11 09:47:04 [root] WARNING: spider quotes finished crawling

答案1

得分: 0

这个问题已通过在else块中的yield scrapy.Request中添加参数dont_filter = True来修复。

else:
    yield scrapy.Request(url=response.request.url,
          callback=self.parse, 
          dont_filter=True,
          meta=dict(
               playwright = True,
               playwright_include_page = True,
               playwright_page_methods = [
               PageMethod('wait_for_selector','body > div > div.quote')]
            ))
英文:

This issue has been fixed by adding the parameter dont_filter = True to the yield scrapy.Request in the else block.

else:
    yield scrapy.Request(url=response.request.url,
          callback=self.parse, 
          dont_filter=True,
          meta=dict(
               playwright = True,
               playwright_include_page = True,
               playwright_page_methods = [
               PageMethod('wait_for_selector','body > div > div.quote')]
            ))

huangapple
  • 本文由 发表于 2023年4月11日 07:13:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/75981380.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定