Scrapy响应 403 设置请求.dont_filter为False

huangapple go评论125阅读模式
英文:

Scrapy response 403 set request.dont_filter False

问题

我目前正在爬取 https://www.carsales.com.au/cars/results 网站。

该网站使用一个名为 'datadome' 的 cookie,该 cookie 在一段时间后会过期,然后所有请求的响应都会返回 403,直到停止。目前在 setting.py 中使用 JOB_DIR 来实现爬取之间的持久数据。

一旦我更新了 cookie,再次启动爬虫,但由于已经对该网站执行了重复的请求,403 页面会被省略。

在我收到响应后是否有一种设置 dont_filter 的方法?

我尝试了以下方法,使用下载中间件,但没有成功。

def process_response(self, request, response, spider):
    #if response.status == 403:
    #    print(request.url, "expired cookie")
    #    request.dont_filter = True
    return response

看起来也可以操作请求中的 seen url,但我找不到如何使用的提示。

提前感谢。

英文:

im currently scraping https://www.carsales.com.au/cars/results

this site use a cookie ('datadome') that expires given some time, then after all requests responses are 403 until it stops. currently using JOB_DIR in setting.py for persistent data between crawls.
Once I update the cookie start the crawler again but 403s pages are omitted because of duplicate requests already done to the site.

Is there a way to set dont_filter once i get the response?

ive tried the following using download middleware with no luck.

def process_response(self, request, response, spider):

    #if response.status == 403:
    #    print(request.url,"expired cookie")
    #    request.dont_filter=True

    return response

Manipulate requests seen url seem an option too but i dont find any hint on how to use it.

Thanks in advance.

答案1

得分: 0

我不确定我理解你的用例,但是回答你的问题:你可以在下载器中间件中重新安排请求。确保它在你的设置中具有高优先级,并且在 process_response 中返回一个经过修改的新请求:

def process_response(self, request, response, spider):
    if response.status == 403:
        print(request.url, "过期的cookie")
        request.dont_filter = True
        return request
    return response

根据文档,如果 process_response 返回一个请求,它将被重新安排,但如果返回响应,它将继续通过中间件进行处理,并返回给你的回调函数。

https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.DownloaderMiddleware.process_response

如果它返回一个响应(可以是相同的响应,也可以是全新的响应),那么该响应将继续由链中下一个中间件的 process_response() 处理。

如果它返回一个请求对象,中间件链将停止,返回的请求将被重新安排以在将来下载。这与从 process_request() 返回请求的行为相同。

英文:

I'm not sure I understand your use case but to answer your question: you can reschedule a request in downloader middleware. Make sure it's priority is high in your settings and in process_response return a new modified request:

def process_response(self, request, response, spider):
    if response.status == 403:
        print(request.url,"expired cookie")
        request.dont_filter=True
        return request
    return response

As per documentation if process_response returns a request it will be rescheduled however if you return response it will continue to be processed through the middlewares and returned to your callback.

https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.DownloaderMiddleware.process_response

> If it returns a Response (it could be the same given response, or a brand-new one), that response will continue to be processed with the process_response() of the next middleware in the chain.
>
> If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().

huangapple
  • 本文由 发表于 2020年1月3日 20:39:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/59578814.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定