英文:
Scrapy response 403 set request.dont_filter False
问题
我目前正在爬取 https://www.carsales.com.au/cars/results 网站。
该网站使用一个名为 'datadome' 的 cookie,该 cookie 在一段时间后会过期,然后所有请求的响应都会返回 403,直到停止。目前在 setting.py 中使用 JOB_DIR 来实现爬取之间的持久数据。
一旦我更新了 cookie,再次启动爬虫,但由于已经对该网站执行了重复的请求,403 页面会被省略。
在我收到响应后是否有一种设置 dont_filter 的方法?
我尝试了以下方法,使用下载中间件,但没有成功。
def process_response(self, request, response, spider):
#if response.status == 403:
# print(request.url, "expired cookie")
# request.dont_filter = True
return response
看起来也可以操作请求中的 seen url,但我找不到如何使用的提示。
提前感谢。
英文:
im currently scraping https://www.carsales.com.au/cars/results
this site use a cookie ('datadome') that expires given some time, then after all requests responses are 403 until it stops. currently using JOB_DIR in setting.py for persistent data between crawls.
Once I update the cookie start the crawler again but 403s pages are omitted because of duplicate requests already done to the site.
Is there a way to set dont_filter once i get the response?
ive tried the following using download middleware with no luck.
def process_response(self, request, response, spider):
#if response.status == 403:
# print(request.url,"expired cookie")
# request.dont_filter=True
return response
Manipulate requests seen url seem an option too but i dont find any hint on how to use it.
Thanks in advance.
答案1
得分: 0
我不确定我理解你的用例,但是回答你的问题:你可以在下载器中间件中重新安排请求。确保它在你的设置中具有高优先级,并且在 process_response
中返回一个经过修改的新请求:
def process_response(self, request, response, spider):
if response.status == 403:
print(request.url, "过期的cookie")
request.dont_filter = True
return request
return response
根据文档,如果 process_response
返回一个请求,它将被重新安排,但如果返回响应,它将继续通过中间件进行处理,并返回给你的回调函数。
如果它返回一个响应(可以是相同的响应,也可以是全新的响应),那么该响应将继续由链中下一个中间件的
process_response()
处理。如果它返回一个请求对象,中间件链将停止,返回的请求将被重新安排以在将来下载。这与从
process_request()
返回请求的行为相同。
英文:
I'm not sure I understand your use case but to answer your question: you can reschedule a request in downloader middleware. Make sure it's priority is high in your settings and in process_response
return a new modified request:
def process_response(self, request, response, spider):
if response.status == 403:
print(request.url,"expired cookie")
request.dont_filter=True
return request
return response
As per documentation if process_response
returns a request it will be rescheduled however if you return response it will continue to be processed through the middlewares and returned to your callback.
> If it returns a Response (it could be the same given response, or a brand-new one), that response will continue to be processed with the process_response() of the next middleware in the chain.
>
> If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论