Scrapy:在start_requests中的CloseSpider不起作用。

huangapple go评论133阅读模式
英文:

Scrapy: CloseSpider in start_requests doesn't work

问题

我正在尝试在def start_requests(self)中使用CloseSpider,然而,似乎我的命令被忽略了。

以下是我的控制台输出。在scrapy.exceptions.CloseSpider之后,我的爬虫似乎继续爬行。

  1. 警告:root:处于DEBUG模式的Scraper
  2. i ======= 1
  3. i ======= 2
  4. i ======= 3
  5. i ======= 4
  6. i ======= 5
  7. i ======= 6
  8. i ======= 7
  9. i ======= 8
  10. i ======= 9
  11. i ======= 10
  12. 警告:root:############ 达到10. ############
  13. 2020-01-03 09:52:37 [root] 警告: ############ 达到10. ############
  14. 获取起始请求时发生错误
  15. Traceback (most recent call last):
  16. File "/Users/Marc/.local/share/virtualenvs/scrapy-Qon0LmmU/lib/python3.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
  17. request = next(slot.start_requests)
  18. File "/Users/Marc/Desktop/Dev/Scraper/scrapy/spider/spider/spiders/spider.py", line 66, in start_requests
  19. raise CloseSpider("############ 达到10. ############")
  20. scrapy.exceptions.CloseSpider
  21. 已丢弃: Dimensions Festival 2012 缺少网站信息
  22. {'pk': 7, 'name': 'Dimensions Festival 2012', 'website': ''}
  23. 已丢弃: Hideout 2012 缺少网站信息
  24. {'pk': 5, 'name': 'Hideout 2012', 'website': ''}
  25. 已丢弃: Beacons Festival 2012 缺少网站信息
  26. {'pk': 6, 'name': 'Beacons Festival 2012', 'website': ''}

spider.py

  1. def start_requests(self):
  2. for i in range(1, 100):
  3. print("i ======= ", i)
  4. if i == 10:
  5. logging.warning("############ 达到10. ############")
  6. raise CloseSpider("############ 达到10. ############")
  7. yield scrapy.Request("https://www.somewebsite.com/api-internal/v1/events/%s/?format=json" % i)
英文:

I am trying to use CloseSpide in def start_requests(self), however, it seems that my command is ignored.

Here my console output. After scrapy.exceptions.CloseSpider my spider seems to continue crawling.

  1. WARNING:root:Scraper in DEBUG mode
  2. i ======= 1
  3. i ======= 2
  4. i ======= 3
  5. i ======= 4
  6. i ======= 5
  7. i ======= 6
  8. i ======= 7
  9. i ======= 8
  10. i ======= 9
  11. i ======= 10
  12. WARNING:root:############ Reached 10. ############
  13. 2020-01-03 09:52:37 [root] WARNING: ############ Reached 10. ############
  14. Error while obtaining start requests
  15. Traceback (most recent call last):
  16. File "/Users/Marc/.local/share/virtualenvs/scrapy-Qon0LmmU/lib/python3.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
  17. request = next(slot.start_requests)
  18. File "/Users/Marc/Desktop/Dev/Scraper/scrapy/spider/spider/spiders/spider.py", line 66, in start_requests
  19. raise CloseSpider("############ Reached 10. ############")
  20. scrapy.exceptions.CloseSpider
  21. Dropped: Missing website for: Dimensions Festival 2012
  22. {'pk': 7, 'name': 'Dimensions Festival 2012', 'website': ''}
  23. Dropped: Missing website for: Hideout 2012
  24. {'pk': 5, 'name': 'Hideout 2012', 'website': ''}
  25. Dropped: Missing website for: Beacons Festival 2012
  26. {'pk': 6, 'name': 'Beacons Festival 2012', 'website': ''}

spider.py

  1. def start_requests(self):
  2. for i in range(1, 100):
  3. print("i ======= ", i)
  4. if i == 10:
  5. logging.warning("############ Reached 10. ############")
  6. raise CloseSpider("############ Reached 10. ############")
  7. yield scrapy.Request("https://www.somewebsite.com/api-internal/v1/events/%s/?format=json" % i)

答案1

得分: 3

很抱歉,start_requests 具有独特的功能,不像回调函数一样处理。对此最简单的解决方法是使用临时 URL 忽略它:

  1. def start_requests(self):
  2. # 使用本地文件开始爬取
  3. yield Request('file:///tmp/file.html')
  4. def parse(self, response):
  5. # 在这里添加您的开始请求逻辑
  6. raise CloseSpider('在这里生效!')

或者,根据您的用例,您可以考虑更干净的解决方案,如 Scrapy 信号[1]。

1 - https://docs.scrapy.org/en/latest/topics/signals.html

英文:

Unfortunately start_requests has unique functionality and is not treated like a callback. The easiest hack for this is to ignore it with a temp url:

  1. def start_requests(self):
  2. # start crawl with local file
  3. yield Request('file:///tmp/file.html')
  4. def parse(self, response):
  5. # add your start requests logic here
  6. raise CloseSpider('works here!')

Alternatively depending on your use case you can look into cleaner solutions like Scrapy Signals[1]

1 - https://docs.scrapy.org/en/latest/topics/signals.html

huangapple
  • 本文由 发表于 2020年1月3日 16:57:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/59575604.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定