2020年1月3日 20:39:53go评论148阅读模式

英文:

Scrapy response 403 set request.dont_filter False

问题

我目前正在爬取 https://www.carsales.com.au/cars/results 网站。

该网站使用一个名为 'datadome' 的 cookie，该 cookie 在一段时间后会过期，然后所有请求的响应都会返回 403，直到停止。目前在 setting.py 中使用 JOB_DIR 来实现爬取之间的持久数据。

一旦我更新了 cookie，再次启动爬虫，但由于已经对该网站执行了重复的请求，403 页面会被省略。

在我收到响应后是否有一种设置 dont_filter 的方法？

我尝试了以下方法，使用下载中间件，但没有成功。

def process_response(self, request, response, spider):
    #if response.status == 403:
    #    print(request.url, "expired cookie")
    #    request.dont_filter = True
    return response

看起来也可以操作请求中的 seen url，但我找不到如何使用的提示。

提前感谢。

英文:

im currently scraping https://www.carsales.com.au/cars/results

this site use a cookie ('datadome') that expires given some time, then after all requests responses are 403 until it stops. currently using JOB_DIR in setting.py for persistent data between crawls.
Once I update the cookie start the crawler again but 403s pages are omitted because of duplicate requests already done to the site.

Is there a way to set dont_filter once i get the response?

ive tried the following using download middleware with no luck.

def process_response(self, request, response, spider):
    #if response.status == 403:
    #    print(request.url,&quot;expired cookie&quot;)
    #    request.dont_filter=True
    return response

Manipulate requests seen url seem an option too but i dont find any hint on how to use it.

Thanks in advance.

答案1

得分: 0

我不确定我理解你的用例，但是回答你的问题：你可以在下载器中间件中重新安排请求。确保它在你的设置中具有高优先级，并且在 process_response 中返回一个经过修改的新请求：

def process_response(self, request, response, spider):
    if response.status == 403:
        print(request.url, "过期的cookie")
        request.dont_filter = True
        return request
    return response

根据文档，如果 process_response 返回一个请求，它将被重新安排，但如果返回响应，它将继续通过中间件进行处理，并返回给你的回调函数。

https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.DownloaderMiddleware.process_response

如果它返回一个响应（可以是相同的响应，也可以是全新的响应），那么该响应将继续由链中下一个中间件的 process_response() 处理。

如果它返回一个请求对象，中间件链将停止，返回的请求将被重新安排以在将来下载。这与从 process_request() 返回请求的行为相同。

英文:

I'm not sure I understand your use case but to answer your question: you can reschedule a request in downloader middleware. Make sure it's priority is high in your settings and in process_response return a new modified request:

def process_response(self, request, response, spider):
    if response.status == 403:
        print(request.url,&quot;expired cookie&quot;)
        request.dont_filter=True
        return request
    return response

As per documentation if process_response returns a request it will be rescheduled however if you return response it will continue to be processed through the middlewares and returned to your callback.

https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.DownloaderMiddleware.process_response

> If it returns a Response (it could be the same given response, or a brand-new one), that response will continue to be processed with the process_response() of the next middleware in the chain.
>
> If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Scrapy响应 403 设置请求.dont_filter为False

问题

答案1

关于缺失的分页元素，需要一些爬取指导。

如何从TradingView链接加载收益日历数据并放入数据框中

找到全部未找到的类。

“Beautiful Soup: AttributeError: ‘NoneType’ object has no attribute ‘text'”

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。