英文:
Scrapy: CloseSpider in start_requests doesn't work
问题
我正在尝试在def start_requests(self)
中使用CloseSpider
,然而,似乎我的命令被忽略了。
以下是我的控制台输出。在scrapy.exceptions.CloseSpider
之后,我的爬虫似乎继续爬行。
警告:root:处于DEBUG模式的Scraper
i ======= 1
i ======= 2
i ======= 3
i ======= 4
i ======= 5
i ======= 6
i ======= 7
i ======= 8
i ======= 9
i ======= 10
警告:root:############ 达到10. ############
2020-01-03 09:52:37 [root] 警告: ############ 达到10. ############
获取起始请求时发生错误
Traceback (most recent call last):
File "/Users/Marc/.local/share/virtualenvs/scrapy-Qon0LmmU/lib/python3.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/Users/Marc/Desktop/Dev/Scraper/scrapy/spider/spider/spiders/spider.py", line 66, in start_requests
raise CloseSpider("############ 达到10. ############")
scrapy.exceptions.CloseSpider
已丢弃: Dimensions Festival 2012 缺少网站信息
{'pk': 7, 'name': 'Dimensions Festival 2012', 'website': ''}
已丢弃: Hideout 2012 缺少网站信息
{'pk': 5, 'name': 'Hideout 2012', 'website': ''}
已丢弃: Beacons Festival 2012 缺少网站信息
{'pk': 6, 'name': 'Beacons Festival 2012', 'website': ''}
spider.py
def start_requests(self):
for i in range(1, 100):
print("i ======= ", i)
if i == 10:
logging.warning("############ 达到10. ############")
raise CloseSpider("############ 达到10. ############")
yield scrapy.Request("https://www.somewebsite.com/api-internal/v1/events/%s/?format=json" % i)
英文:
I am trying to use CloseSpide in def start_requests(self)
, however, it seems that my command is ignored.
Here my console output. After scrapy.exceptions.CloseSpider
my spider seems to continue crawling.
WARNING:root:Scraper in DEBUG mode
i ======= 1
i ======= 2
i ======= 3
i ======= 4
i ======= 5
i ======= 6
i ======= 7
i ======= 8
i ======= 9
i ======= 10
WARNING:root:############ Reached 10. ############
2020-01-03 09:52:37 [root] WARNING: ############ Reached 10. ############
Error while obtaining start requests
Traceback (most recent call last):
File "/Users/Marc/.local/share/virtualenvs/scrapy-Qon0LmmU/lib/python3.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/Users/Marc/Desktop/Dev/Scraper/scrapy/spider/spider/spiders/spider.py", line 66, in start_requests
raise CloseSpider("############ Reached 10. ############")
scrapy.exceptions.CloseSpider
Dropped: Missing website for: Dimensions Festival 2012
{'pk': 7, 'name': 'Dimensions Festival 2012', 'website': ''}
Dropped: Missing website for: Hideout 2012
{'pk': 5, 'name': 'Hideout 2012', 'website': ''}
Dropped: Missing website for: Beacons Festival 2012
{'pk': 6, 'name': 'Beacons Festival 2012', 'website': ''}
spider.py
def start_requests(self):
for i in range(1, 100):
print("i ======= ", i)
if i == 10:
logging.warning("############ Reached 10. ############")
raise CloseSpider("############ Reached 10. ############")
yield scrapy.Request("https://www.somewebsite.com/api-internal/v1/events/%s/?format=json" % i)
答案1
得分: 3
很抱歉,start_requests
具有独特的功能,不像回调函数一样处理。对此最简单的解决方法是使用临时 URL 忽略它:
def start_requests(self):
# 使用本地文件开始爬取
yield Request('file:///tmp/file.html')
def parse(self, response):
# 在这里添加您的开始请求逻辑
raise CloseSpider('在这里生效!')
或者,根据您的用例,您可以考虑更干净的解决方案,如 Scrapy 信号[1]。
1 - https://docs.scrapy.org/en/latest/topics/signals.html
英文:
Unfortunately start_requests
has unique functionality and is not treated like a callback. The easiest hack for this is to ignore it with a temp url:
def start_requests(self):
# start crawl with local file
yield Request('file:///tmp/file.html')
def parse(self, response):
# add your start requests logic here
raise CloseSpider('works here!')
Alternatively depending on your use case you can look into cleaner solutions like Scrapy Signals[1]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论