Scrapy:在start_requests中的CloseSpider不起作用。

huangapple go评论95阅读模式
英文:

Scrapy: CloseSpider in start_requests doesn't work

问题

我正在尝试在def start_requests(self)中使用CloseSpider,然而,似乎我的命令被忽略了。

以下是我的控制台输出。在scrapy.exceptions.CloseSpider之后,我的爬虫似乎继续爬行。

警告:root:处于DEBUG模式的Scraper
i =======  1
i =======  2
i =======  3
i =======  4
i =======  5
i =======  6
i =======  7
i =======  8
i =======  9
i =======  10
警告:root:############ 达到10. ############
2020-01-03 09:52:37 [root] 警告: ############ 达到10. ############
获取起始请求时发生错误
Traceback (most recent call last):
  File "/Users/Marc/.local/share/virtualenvs/scrapy-Qon0LmmU/lib/python3.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/Users/Marc/Desktop/Dev/Scraper/scrapy/spider/spider/spiders/spider.py", line 66, in start_requests
    raise CloseSpider("############ 达到10. ############")
scrapy.exceptions.CloseSpider
已丢弃: Dimensions Festival 2012 缺少网站信息
{'pk': 7, 'name': 'Dimensions Festival 2012', 'website': ''}
已丢弃: Hideout 2012 缺少网站信息
{'pk': 5, 'name': 'Hideout 2012', 'website': ''}
已丢弃: Beacons Festival 2012 缺少网站信息
{'pk': 6, 'name': 'Beacons Festival 2012', 'website': ''}

spider.py

def start_requests(self):
    for i in range(1, 100):
        print("i ======= ", i)
        if i == 10:
            logging.warning("############ 达到10. ############")
            raise CloseSpider("############ 达到10. ############")
        yield scrapy.Request("https://www.somewebsite.com/api-internal/v1/events/%s/?format=json" % i)
英文:

I am trying to use CloseSpide in def start_requests(self), however, it seems that my command is ignored.

Here my console output. After scrapy.exceptions.CloseSpider my spider seems to continue crawling.

WARNING:root:Scraper in DEBUG mode
i =======  1
i =======  2
i =======  3
i =======  4
i =======  5
i =======  6
i =======  7
i =======  8
i =======  9
i =======  10
WARNING:root:############ Reached 10. ############
2020-01-03 09:52:37 [root] WARNING: ############ Reached 10. ############
Error while obtaining start requests
Traceback (most recent call last):
  File "/Users/Marc/.local/share/virtualenvs/scrapy-Qon0LmmU/lib/python3.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/Users/Marc/Desktop/Dev/Scraper/scrapy/spider/spider/spiders/spider.py", line 66, in start_requests
    raise CloseSpider("############ Reached 10. ############")
scrapy.exceptions.CloseSpider
Dropped: Missing website for: Dimensions Festival 2012
{'pk': 7, 'name': 'Dimensions Festival 2012', 'website': ''}
Dropped: Missing website for: Hideout 2012
{'pk': 5, 'name': 'Hideout 2012', 'website': ''}
Dropped: Missing website for: Beacons Festival 2012
{'pk': 6, 'name': 'Beacons Festival 2012', 'website': ''}

spider.py

def start_requests(self):
    for i in range(1, 100):
        print("i ======= ", i)
        if i == 10:
            logging.warning("############ Reached 10. ############")
            raise CloseSpider("############ Reached 10. ############")
        yield scrapy.Request("https://www.somewebsite.com/api-internal/v1/events/%s/?format=json" % i)

答案1

得分: 3

很抱歉,start_requests 具有独特的功能,不像回调函数一样处理。对此最简单的解决方法是使用临时 URL 忽略它:

def start_requests(self):
    # 使用本地文件开始爬取
    yield Request('file:///tmp/file.html')

def parse(self, response):
    # 在这里添加您的开始请求逻辑
    raise CloseSpider('在这里生效!')

或者,根据您的用例,您可以考虑更干净的解决方案,如 Scrapy 信号[1]。

1 - https://docs.scrapy.org/en/latest/topics/signals.html

英文:

Unfortunately start_requests has unique functionality and is not treated like a callback. The easiest hack for this is to ignore it with a temp url:

def start_requests(self):
    # start crawl with local file
    yield Request('file:///tmp/file.html')

def parse(self, response):
    # add your start requests logic here
    raise CloseSpider('works here!')

Alternatively depending on your use case you can look into cleaner solutions like Scrapy Signals[1]

1 - https://docs.scrapy.org/en/latest/topics/signals.html

huangapple
  • 本文由 发表于 2020年1月3日 16:57:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/59575604.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定