问题

这是您提供的Python代码，您想让我翻译它吗？如果是的话，请指定要翻译的部分。

英文:

I'm using Scrapy with the Playwright plugin to crawl a website that relies on JavaScript for rendering.
My spider includes two asynchronous functions, parse_categories and parse_product_page.

The parse_categories function checks for categories in the URL and sends requests to the parse_categories callback again until a product page is found which should be when no categories are found.
If no categories are found, it should send a request to the parse_product_page callback.

However, when it reaches the else block in parse_categories, it seems that the request to parse_product_page is never made. I've confirmed that the code enters the else block, but the print statement in the parse_product_page function is never reached.

Here is my reprex:

import scrapy
from scrapy_playwright.page import PageMethod
class Spider():
    name = &quot;quotes&quot;
    allowed_domains = [&#39;quotes.toscrape.com&#39;]
  
    def start_requests(self):
        yield scrapy.Request(url=&#39;https://quotes.toscrape.com/js/&#39;, callback=self.parse_urls, 
              meta=dict(
                   playwright = True, 
                   playwright_include_page = True,
                   playwright_page_methods = [
                         PageMethod(&#39;wait_for_selector&#39;,&#39;body &gt; div &gt; nav &gt; ul &gt; li &gt; a&#39;)
                        ],
                   ))
    
    async def parse_urls(self, response):
        page = response.meta[&#39;playwright_page&#39;]
        await page.close()
        
        next_page_url = response.xpath(&#39;//li[@class=&quot;next&quot;]/a/@href&#39;).get()
        if next_page_url:
            print(&quot;Inside if block&quot;)
            url = &#39;https://quotes.toscrape.com&#39; + next_page_url
            yield scrapy.Request(url=url,callback=self.parse_urls,
                meta=dict(
                    playwright = True,
                    playwright_include_page = True,
                    playwright_page_methods = [
                        PageMethod(&#39;wait_for_selector&#39;,&#39;body &gt; div &gt; div.quote&#39;)]
                        ))
        else:
            print(&quot;Next page link not found&quot;)
            yield scrapy.Request(url=response.request.url, callback=self.parse, 
                    meta=dict(
                        playwright = True,
                        playwright_include_page = True,
                        playwright_page_methods = [
                            PageMethod(&#39;wait_for_selector&#39;,&#39;body &gt; div &gt; div.quote&#39;)]
                        ))
    async def parse(self,response):
        page = response.meta[&#39;playwright_page&#39;]
        await page.close()
        print(&quot;Function has been called, because next page link not found&quot;)

This is the logs from the reprex:

Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Next page link not found
2023-04-11 09:47:04 [root] WARNING: spider quotes finished crawling

答案1

得分: 0

这个问题已通过在else块中的yield scrapy.Request中添加参数dont_filter = True来修复。

else:
    yield scrapy.Request(url=response.request.url,
          callback=self.parse, 
          dont_filter=True,
          meta=dict(
               playwright = True,
               playwright_include_page = True,
               playwright_page_methods = [
               PageMethod('wait_for_selector','body > div > div.quote')]
            ))

英文:

This issue has been fixed by adding the parameter dont_filter = True to the yield scrapy.Request in the else block.

else:
    yield scrapy.Request(url=response.request.url,
          callback=self.parse, 
          dont_filter=True,
          meta=dict(
               playwright = True,
               playwright_include_page = True,
               playwright_page_methods = [
               PageMethod(&#39;wait_for_selector&#39;,&#39;body &gt; div &gt; div.quote&#39;)]
            ))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Scrapy回调在使用Playwright进行JavaScript渲染时未执行。

问题

答案1

在Python中复制2D数组

Go网络爬虫卡住了。

将元素循环添加到列表中

Read .csv file with columns of varying length as dictionary in Python.

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。