运行数千个相同爬虫实例

huangapple go评论65阅读模式
英文:

scrapy run thousands of instance of the same spider

问题

我有以下任务:
在数据库中,我们有约2,000个URL。
对于每个URL,我们需要运行爬虫,直到所有URL都被处理。
我已经为一组URL(每次运行10个)运行了爬虫。

我使用了以下代码:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

URLs = crawler_table.find(crawl_timestamp=None)
settings = get_project_settings()
for i in range(len(URLs) // 10):
    process = CrawlerProcess(settings)

    limit = 10
    kount = 0

    for crawl in crawler_table.find(crawl_timestamp=None):
        if kount < limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[crawl['crawl_url']]
           )
    process = CrawlerProcess(settings)
    process.start()

但它只运行了第一个循环。
对于第二个循环,我遇到了错误:

  File "C:\Program Files\Python310\lib\site-packages\scrapy\crawler.py", line 327, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1314, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1296, in startRunning
    ReactorBase.startRunning(cast(ReactorBase, self))
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet/base.py", line 840, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

是否有解决这个错误并运行爬虫处理所有2,000个URL的解决方案?

英文:

I have the following task:
in the DB we have ~2k URLs.
for each URL we need to run spider until all URLs will be processed.
I was running spider for a bunch of URLs (10 in one run)

I have used the following code:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

URLs = crawler_table.find(crawl_timestamp=None)
settings = get_project_settings()
for i in range(len(URLs) // 10):
    process = CrawlerProcess(settings)

    limit = 10
    kount = 0

    for crawl in crawler_table.find(crawl_timestamp=None):
        if kount &lt; limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[crawl[&#39;crawl_url&#39;]]
           )
    process = CrawlerProcess(settings)
    process.start()

but it is running only for the first loop.
for the second I have the error:

  File &quot;C:\Program Files\Python310\lib\site-packages\scrapy\crawler.py&quot;, line 327, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File &quot;C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py&quot;, line 1314, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File &quot;C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py&quot;, line 1296, in startRunning
    ReactorBase.startRunning(cast(ReactorBase, self))
  File &quot;C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py&quot;, line 840, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

is there any solution to avoid this error? and run spider for all 2k URLs?

答案1

得分: 2

这是因为您不能在同一进程中两次启动扭曲的反应器。您可以使用多进程,将每个批次启动在单独的进程中。您的代码可能如下所示:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import multiprocessing as mp

def start_crawlers(urls_batchs, limit=10):
    settings = get_project_settings()
    process = CrawlerProcess(settings)

    kount = 0

    for batch in urls_batchs:
        if kount < limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[batch]
           )
    process.start()
if __name__ == "__main__":
    URLs = ...
    for urls_batchs in URLs:
        process = mp.Process(target=start_crawlers, args=(urls_batchs,))
        process.start()
        process.join()
英文:

This is because you can't start twisted reactor in the same process twice. you can use multiprocessing and launch each batch in separate process. your code may look like this:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import multiprocessing as mp

def start_crawlers(urls_batchs, limit = 10):
    settings = get_project_settings()
    process = CrawlerProcess(settings)

    kount = 0

    for batch in urls_batchs:
        if kount &lt; limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[batch]
           )
    process.start()
if __name__ == &quot;__main__&quot;:
    URLs = ...
    for urls_batchs in URLs:
        process = mp.Process(target=start_crawlers, args=(urls_batchs,))
        process.start()
        process.join()

huangapple
  • 本文由 发表于 2023年3月9日 17:01:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75682376.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定