问题

我有以下任务：
在数据库中，我们有约2,000个URL。
对于每个URL，我们需要运行爬虫，直到所有URL都被处理。
我已经为一组URL（每次运行10个）运行了爬虫。

我使用了以下代码：

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

URLs = crawler_table.find(crawl_timestamp=None)
settings = get_project_settings()
for i in range(len(URLs) // 10):
    process = CrawlerProcess(settings)

    limit = 10
    kount = 0

    for crawl in crawler_table.find(crawl_timestamp=None):
        if kount < limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[crawl['crawl_url']]
           )
    process = CrawlerProcess(settings)
    process.start()

但它只运行了第一个循环。
对于第二个循环，我遇到了错误：

  File "C:\Program Files\Python310\lib\site-packages\scrapy\crawler.py", line 327, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1314, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1296, in startRunning
    ReactorBase.startRunning(cast(ReactorBase, self))
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet/base.py", line 840, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

是否有解决这个错误并运行爬虫处理所有2,000个URL的解决方案？

英文:

I have the following task:
in the DB we have ~2k URLs.
for each URL we need to run spider until all URLs will be processed.
I was running spider for a bunch of URLs (10 in one run)

I have used the following code:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

URLs = crawler_table.find(crawl_timestamp=None)
settings = get_project_settings()
for i in range(len(URLs) // 10):
    process = CrawlerProcess(settings)

    limit = 10
    kount = 0

    for crawl in crawler_table.find(crawl_timestamp=None):
        if kount &lt; limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[crawl[&#39;crawl_url&#39;]]
           )
    process = CrawlerProcess(settings)
    process.start()

but it is running only for the first loop.
for the second I have the error:

  File &quot;C:\Program Files\Python310\lib\site-packages\scrapy\crawler.py&quot;, line 327, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File &quot;C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py&quot;, line 1314, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File &quot;C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py&quot;, line 1296, in startRunning
    ReactorBase.startRunning(cast(ReactorBase, self))
  File &quot;C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py&quot;, line 840, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

is there any solution to avoid this error? and run spider for all 2k URLs?

答案1

得分: 2

这是因为您不能在同一进程中两次启动扭曲的反应器。您可以使用多进程，将每个批次启动在单独的进程中。您的代码可能如下所示：

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import multiprocessing as mp

def start_crawlers(urls_batchs, limit=10):
    settings = get_project_settings()
    process = CrawlerProcess(settings)

    kount = 0

    for batch in urls_batchs:
        if kount < limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[batch]
           )
    process.start()
if __name__ == "__main__":
    URLs = ...
    for urls_batchs in URLs:
        process = mp.Process(target=start_crawlers, args=(urls_batchs,))
        process.start()
        process.join()

英文:

This is because you can't start twisted reactor in the same process twice. you can use multiprocessing and launch each batch in separate process. your code may look like this:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import multiprocessing as mp

def start_crawlers(urls_batchs, limit = 10):
    settings = get_project_settings()
    process = CrawlerProcess(settings)

    kount = 0

    for batch in urls_batchs:
        if kount &lt; limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[batch]
           )
    process.start()
if __name__ == &quot;__main__&quot;:
    URLs = ...
    for urls_batchs in URLs:
        process = mp.Process(target=start_crawlers, args=(urls_batchs,))
        process.start()
        process.join()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

运行数千个相同爬虫实例

问题

答案1

如何构建一个独立的Python虚拟环境？

‘_cext’ – 导入OrbitalPy时找不到此模块

How to reorder columns on a subclassed pandas Dataframe

在Python中，即使使用try/except语句捕获异常，程序仍然可能崩溃吗？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论