英文:
Scrapy callback not executed when using Playwright for JavaScript rendering
问题
这是您提供的Python代码,您想让我翻译它吗?如果是的话,请指定要翻译的部分。
英文:
I'm using Scrapy with the Playwright plugin to crawl a website that relies on JavaScript for rendering.
My spider includes two asynchronous functions, parse_categories and parse_product_page.
The parse_categories function checks for categories in the URL and sends requests to the parse_categories callback again until a product page is found which should be when no categories are found.
If no categories are found, it should send a request to the parse_product_page callback.
However, when it reaches the else block in parse_categories, it seems that the request to parse_product_page is never made. I've confirmed that the code enters the else block, but the print statement in the parse_product_page function is never reached.
Here is my reprex:
import scrapy
from scrapy_playwright.page import PageMethod
class Spider():
name = "quotes"
allowed_domains = ['quotes.toscrape.com']
def start_requests(self):
yield scrapy.Request(url='https://quotes.toscrape.com/js/', callback=self.parse_urls,
meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = [
PageMethod('wait_for_selector','body > div > nav > ul > li > a')
],
))
async def parse_urls(self, response):
page = response.meta['playwright_page']
await page.close()
next_page_url = response.xpath('//li[@class="next"]/a/@href').get()
if next_page_url:
print("Inside if block")
url = 'https://quotes.toscrape.com' + next_page_url
yield scrapy.Request(url=url,callback=self.parse_urls,
meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = [
PageMethod('wait_for_selector','body > div > div.quote')]
))
else:
print("Next page link not found")
yield scrapy.Request(url=response.request.url, callback=self.parse,
meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = [
PageMethod('wait_for_selector','body > div > div.quote')]
))
async def parse(self,response):
page = response.meta['playwright_page']
await page.close()
print("Function has been called, because next page link not found")
This is the logs from the reprex:
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Next page link not found
2023-04-11 09:47:04 [root] WARNING: spider quotes finished crawling
答案1
得分: 0
这个问题已通过在else
块中的yield scrapy.Request
中添加参数dont_filter = True
来修复。
else:
yield scrapy.Request(url=response.request.url,
callback=self.parse,
dont_filter=True,
meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = [
PageMethod('wait_for_selector','body > div > div.quote')]
))
英文:
This issue has been fixed by adding the parameter dont_filter = True to the yield scrapy.Request in the else block.
else:
yield scrapy.Request(url=response.request.url,
callback=self.parse,
dont_filter=True,
meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = [
PageMethod('wait_for_selector','body > div > div.quote')]
))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论