英文:
StormCrawler: The URL Database Specifications
问题
我对StormCrawler还不太了解——在浏览文档以及自述文件和其他资源时,我注意到经常提到一个名为**“URL数据库”**的东西,它应该处理有关爬虫运行过程中URL的信息(例如这里)。
然而,我没有找到任何关于这个数据库是什么类型的信息,也没有找到如何自定义它或者用自定义模块替换它的方法。我一直在跟踪代码,找到了IOOutputController
,其中有一些相当令人困惑的方法,缺乏文档字符串,实际上甚至很难确定负责处理这一部分的类。
如果能提供任何指导,我将非常感激!
谢谢您的时间,Matyáš
英文:
I am quite new to StormCrawler - as I have been exploring the documentation, as well as the READMEs and additional resources, I have noticed that it is often referred to a "URL database" which should handle storing information concerning the the URLs from the run of the crawler (for example here).
I have, however, not found anywhere of what type this database is, nor how to customize it or replace it with custom modules. I have been following the code and got to IOOutputController
, which has some quite confusing methods and with the lack of docstrings, it is quite challenging to actually even determine the class responsible for handling this.
I would be very grateful for any guidance!
Thank you for your time, Matyáš
答案1
得分: 0
以下是翻译好的内容:
在StormCrawler中,最常用于存储URL的方式是Elasticsearch。这在教程中有所说明1。还有其他可用的方式,比如SQL或SOLR,详见2;StormCrawler并不局限于特定的数据库。
在大多数情况下,人们只需使用现有的后端实现,比如Elasticsearch的实现。
英文:
The most commonly used storage for the URLs in StormCrawler is Elasticsearch. This is illustrated in the tutorials. There are other ones available such as SQL or SOLR, see enter link description here; StormCrawler is not limited to a specific database.
In most cases, people just use an existing backend implementation such as the Elasticsearch one.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论