问题

我正在尝试在def start_requests(self)中使用CloseSpider，然而，似乎我的命令被忽略了。

以下是我的控制台输出。在scrapy.exceptions.CloseSpider之后，我的爬虫似乎继续爬行。

警告:root:处于DEBUG模式的Scraper
i =======  1
i =======  2
i =======  3
i =======  4
i =======  5
i =======  6
i =======  7
i =======  8
i =======  9
i =======  10
警告:root:############ 达到10. ############
2020-01-03 09:52:37 [root] 警告: ############ 达到10. ############
获取起始请求时发生错误
Traceback (most recent call last):
  File "/Users/Marc/.local/share/virtualenvs/scrapy-Qon0LmmU/lib/python3.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/Users/Marc/Desktop/Dev/Scraper/scrapy/spider/spider/spiders/spider.py", line 66, in start_requests
    raise CloseSpider("############ 达到10. ############")
scrapy.exceptions.CloseSpider
已丢弃: Dimensions Festival 2012 缺少网站信息
{'pk': 7, 'name': 'Dimensions Festival 2012', 'website': ''}
已丢弃: Hideout 2012 缺少网站信息
{'pk': 5, 'name': 'Hideout 2012', 'website': ''}
已丢弃: Beacons Festival 2012 缺少网站信息
{'pk': 6, 'name': 'Beacons Festival 2012', 'website': ''}

spider.py

def start_requests(self):
    for i in range(1, 100):
        print("i ======= ", i)
        if i == 10:
            logging.warning("############ 达到10. ############")
            raise CloseSpider("############ 达到10. ############")
        yield scrapy.Request("https://www.somewebsite.com/api-internal/v1/events/%s/?format=json" % i)

英文:

I am trying to use CloseSpide in def start_requests(self), however, it seems that my command is ignored.

Here my console output. After scrapy.exceptions.CloseSpider my spider seems to continue crawling.

WARNING:root:Scraper in DEBUG mode
i =======  1
i =======  2
i =======  3
i =======  4
i =======  5
i =======  6
i =======  7
i =======  8
i =======  9
i =======  10
WARNING:root:############ Reached 10. ############
2020-01-03 09:52:37 [root] WARNING: ############ Reached 10. ############
Error while obtaining start requests
Traceback (most recent call last):
  File &quot;/Users/Marc/.local/share/virtualenvs/scrapy-Qon0LmmU/lib/python3.7/site-packages/scrapy/core/engine.py&quot;, line 127, in _next_request
    request = next(slot.start_requests)
  File &quot;/Users/Marc/Desktop/Dev/Scraper/scrapy/spider/spider/spiders/spider.py&quot;, line 66, in start_requests
    raise CloseSpider(&quot;############ Reached 10. ############&quot;)
scrapy.exceptions.CloseSpider
Dropped: Missing website for: Dimensions Festival 2012
{&#39;pk&#39;: 7, &#39;name&#39;: &#39;Dimensions Festival 2012&#39;, &#39;website&#39;: &#39;&#39;}
Dropped: Missing website for: Hideout 2012
{&#39;pk&#39;: 5, &#39;name&#39;: &#39;Hideout 2012&#39;, &#39;website&#39;: &#39;&#39;}
Dropped: Missing website for: Beacons Festival 2012
{&#39;pk&#39;: 6, &#39;name&#39;: &#39;Beacons Festival 2012&#39;, &#39;website&#39;: &#39;&#39;}

spider.py

def start_requests(self):
    for i in range(1, 100):
        print(&quot;i ======= &quot;, i)
        if i == 10:
            logging.warning(&quot;############ Reached 10. ############&quot;)
            raise CloseSpider(&quot;############ Reached 10. ############&quot;)
        yield scrapy.Request(&quot;https://www.somewebsite.com/api-internal/v1/events/%s/?format=json&quot; % i)

答案1

得分: 3

很抱歉，start_requests 具有独特的功能，不像回调函数一样处理。对此最简单的解决方法是使用临时 URL 忽略它：

def start_requests(self):
    # 使用本地文件开始爬取
    yield Request('file:///tmp/file.html')

def parse(self, response):
    # 在这里添加您的开始请求逻辑
    raise CloseSpider('在这里生效！')

或者，根据您的用例，您可以考虑更干净的解决方案，如 Scrapy 信号[1]。

1 - https://docs.scrapy.org/en/latest/topics/signals.html

英文:

Unfortunately start_requests has unique functionality and is not treated like a callback. The easiest hack for this is to ignore it with a temp url:

def start_requests(self):
    # start crawl with local file
    yield Request(&#39;file:///tmp/file.html&#39;)

def parse(self, response):
    # add your start requests logic here
    raise CloseSpider(&#39;works here!&#39;)

Alternatively depending on your use case you can look into cleaner solutions like Scrapy Signals[1]

1 - https://docs.scrapy.org/en/latest/topics/signals.html

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Scrapy：在start_requests中的CloseSpider不起作用。

问题

答案1

从多个迭代器的任意状态进行迭代，具有：i1 < i2 < i3 ... < in

what does the keyword "\n" do in python? I don't know what it means

八进制转十进制的数字转换

如何在CMake安装中创建一个Python 3虚拟环境？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论