英文:
scrapy run thousands of instance of the same spider
问题
我有以下任务:
在数据库中,我们有约2,000个URL。
对于每个URL,我们需要运行爬虫,直到所有URL都被处理。
我已经为一组URL(每次运行10个)运行了爬虫。
我使用了以下代码:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
URLs = crawler_table.find(crawl_timestamp=None)
settings = get_project_settings()
for i in range(len(URLs) // 10):
process = CrawlerProcess(settings)
limit = 10
kount = 0
for crawl in crawler_table.find(crawl_timestamp=None):
if kount < limit:
kount += 1
process.crawl(
MySpider,
start_urls=[crawl['crawl_url']]
)
process = CrawlerProcess(settings)
process.start()
但它只运行了第一个循环。
对于第二个循环,我遇到了错误:
File "C:\Program Files\Python310\lib\site-packages\scrapy\crawler.py", line 327, in start
reactor.run(installSignalHandlers=False) # blocking call
File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1314, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1296, in startRunning
ReactorBase.startRunning(cast(ReactorBase, self))
File "C:\Program Files\Python310\lib\site-packages\twisted\internet/base.py", line 840, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
是否有解决这个错误并运行爬虫处理所有2,000个URL的解决方案?
英文:
I have the following task:
in the DB we have ~2k URLs.
for each URL we need to run spider until all URLs will be processed.
I was running spider for a bunch of URLs (10 in one run)
I have used the following code:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
URLs = crawler_table.find(crawl_timestamp=None)
settings = get_project_settings()
for i in range(len(URLs) // 10):
process = CrawlerProcess(settings)
limit = 10
kount = 0
for crawl in crawler_table.find(crawl_timestamp=None):
if kount < limit:
kount += 1
process.crawl(
MySpider,
start_urls=[crawl['crawl_url']]
)
process = CrawlerProcess(settings)
process.start()
but it is running only for the first loop.
for the second I have the error:
File "C:\Program Files\Python310\lib\site-packages\scrapy\crawler.py", line 327, in start
reactor.run(installSignalHandlers=False) # blocking call
File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1314, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1296, in startRunning
ReactorBase.startRunning(cast(ReactorBase, self))
File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 840, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
is there any solution to avoid this error? and run spider for all 2k URLs?
答案1
得分: 2
这是因为您不能在同一进程中两次启动扭曲的反应器。您可以使用多进程,将每个批次启动在单独的进程中。您的代码可能如下所示:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import multiprocessing as mp
def start_crawlers(urls_batchs, limit=10):
settings = get_project_settings()
process = CrawlerProcess(settings)
kount = 0
for batch in urls_batchs:
if kount < limit:
kount += 1
process.crawl(
MySpider,
start_urls=[batch]
)
process.start()
if __name__ == "__main__":
URLs = ...
for urls_batchs in URLs:
process = mp.Process(target=start_crawlers, args=(urls_batchs,))
process.start()
process.join()
英文:
This is because you can't start twisted reactor in the same process twice. you can use multiprocessing and launch each batch in separate process. your code may look like this:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import multiprocessing as mp
def start_crawlers(urls_batchs, limit = 10):
settings = get_project_settings()
process = CrawlerProcess(settings)
kount = 0
for batch in urls_batchs:
if kount < limit:
kount += 1
process.crawl(
MySpider,
start_urls=[batch]
)
process.start()
if __name__ == "__main__":
URLs = ...
for urls_batchs in URLs:
process = mp.Process(target=start_crawlers, args=(urls_batchs,))
process.start()
process.join()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论