英文:
Why does my Apache Nutch warc and commoncrawldump fail after crawl?
问题
我已经成功使用 Nutch 爬取了一个网站,现在我想从结果中创建一个 warc 文件。然而,运行 warc 和 commoncrawldump 命令都失败了。此外,在相同的段文件夹上运行 bin/nutch dump -segment ....
命令可以成功运行。
我正在使用 Nutch v-1.17,并且运行以下命令:
bin/nutch commoncrawldump -outputDir output/ -segment crawl/segments
从 hadoop.log 中的错误是 ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/
,尽管刚刚在那里运行了一个爬取操作。
英文:
I have successfully crawled a website using Nutch and now I want to create a warc from the results. However, running both the warc and commoncrawldump commands fail. Also, running bin/nutch dump -segement ....
works successfully on the same segment folder.
I am using nutch v-1.17 and running:
bin/nutch commoncrawldump -outputDir output/ -segment crawl/segments
The error from hadoop.log is ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/
despite having just ran a crawl there.
答案1
得分: 0
在“segments”文件夹内,包含了上一次爬取中出现错误的片段。这些片段并未包含所有的片段数据,因为我认为爬取过程被取消或提前完成。这导致整个过程失败。删除所有这些文件并重新开始解决了这个问题。
英文:
Inside the segments folder were segments from a previous crawl that were throwing up the error. They did not contain all the segment data as I believe the crawl was cancelled/finished early. This caused the entire process to fail. Deleting all those files and starting anew fixed the issue.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论