运行数千个相同爬虫实例

huangapple go评论100阅读模式
英文:

scrapy run thousands of instance of the same spider

问题

我有以下任务:
在数据库中,我们有约2,000个URL。
对于每个URL,我们需要运行爬虫,直到所有URL都被处理。
我已经为一组URL(每次运行10个)运行了爬虫。

我使用了以下代码:

  1. from scrapy.crawler import CrawlerProcess
  2. from scrapy.utils.project import get_project_settings
  3. URLs = crawler_table.find(crawl_timestamp=None)
  4. settings = get_project_settings()
  5. for i in range(len(URLs) // 10):
  6. process = CrawlerProcess(settings)
  7. limit = 10
  8. kount = 0
  9. for crawl in crawler_table.find(crawl_timestamp=None):
  10. if kount < limit:
  11. kount += 1
  12. process.crawl(
  13. MySpider,
  14. start_urls=[crawl['crawl_url']]
  15. )
  16. process = CrawlerProcess(settings)
  17. process.start()

但它只运行了第一个循环。
对于第二个循环,我遇到了错误:

  1. File "C:\Program Files\Python310\lib\site-packages\scrapy\crawler.py", line 327, in start
  2. reactor.run(installSignalHandlers=False) # blocking call
  3. File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1314, in run
  4. self.startRunning(installSignalHandlers=installSignalHandlers)
  5. File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1296, in startRunning
  6. ReactorBase.startRunning(cast(ReactorBase, self))
  7. File "C:\Program Files\Python310\lib\site-packages\twisted\internet/base.py", line 840, in startRunning
  8. raise error.ReactorNotRestartable()
  9. twisted.internet.error.ReactorNotRestartable

是否有解决这个错误并运行爬虫处理所有2,000个URL的解决方案?

英文:

I have the following task:
in the DB we have ~2k URLs.
for each URL we need to run spider until all URLs will be processed.
I was running spider for a bunch of URLs (10 in one run)

I have used the following code:

  1. from scrapy.crawler import CrawlerProcess
  2. from scrapy.utils.project import get_project_settings
  3. URLs = crawler_table.find(crawl_timestamp=None)
  4. settings = get_project_settings()
  5. for i in range(len(URLs) // 10):
  6. process = CrawlerProcess(settings)
  7. limit = 10
  8. kount = 0
  9. for crawl in crawler_table.find(crawl_timestamp=None):
  10. if kount &lt; limit:
  11. kount += 1
  12. process.crawl(
  13. MySpider,
  14. start_urls=[crawl[&#39;crawl_url&#39;]]
  15. )
  16. process = CrawlerProcess(settings)
  17. process.start()

but it is running only for the first loop.
for the second I have the error:

  1. File &quot;C:\Program Files\Python310\lib\site-packages\scrapy\crawler.py&quot;, line 327, in start
  2. reactor.run(installSignalHandlers=False) # blocking call
  3. File &quot;C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py&quot;, line 1314, in run
  4. self.startRunning(installSignalHandlers=installSignalHandlers)
  5. File &quot;C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py&quot;, line 1296, in startRunning
  6. ReactorBase.startRunning(cast(ReactorBase, self))
  7. File &quot;C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py&quot;, line 840, in startRunning
  8. raise error.ReactorNotRestartable()
  9. twisted.internet.error.ReactorNotRestartable

is there any solution to avoid this error? and run spider for all 2k URLs?

答案1

得分: 2

这是因为您不能在同一进程中两次启动扭曲的反应器。您可以使用多进程,将每个批次启动在单独的进程中。您的代码可能如下所示:

  1. from scrapy.crawler import CrawlerProcess
  2. from scrapy.utils.project import get_project_settings
  3. import multiprocessing as mp
  4. def start_crawlers(urls_batchs, limit=10):
  5. settings = get_project_settings()
  6. process = CrawlerProcess(settings)
  7. kount = 0
  8. for batch in urls_batchs:
  9. if kount < limit:
  10. kount += 1
  11. process.crawl(
  12. MySpider,
  13. start_urls=[batch]
  14. )
  15. process.start()
  16. if __name__ == "__main__":
  17. URLs = ...
  18. for urls_batchs in URLs:
  19. process = mp.Process(target=start_crawlers, args=(urls_batchs,))
  20. process.start()
  21. process.join()
英文:

This is because you can't start twisted reactor in the same process twice. you can use multiprocessing and launch each batch in separate process. your code may look like this:

  1. from scrapy.crawler import CrawlerProcess
  2. from scrapy.utils.project import get_project_settings
  3. import multiprocessing as mp
  4. def start_crawlers(urls_batchs, limit = 10):
  5. settings = get_project_settings()
  6. process = CrawlerProcess(settings)
  7. kount = 0
  8. for batch in urls_batchs:
  9. if kount &lt; limit:
  10. kount += 1
  11. process.crawl(
  12. MySpider,
  13. start_urls=[batch]
  14. )
  15. process.start()
  16. if __name__ == &quot;__main__&quot;:
  17. URLs = ...
  18. for urls_batchs in URLs:
  19. process = mp.Process(target=start_crawlers, args=(urls_batchs,))
  20. process.start()
  21. process.join()

huangapple
  • 本文由 发表于 2023年3月9日 17:01:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75682376.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定