为什么我的Apache Nutch在抓取后无法生成warc文件和commoncrawldump文件?

huangapple go评论64阅读模式
英文:

Why does my Apache Nutch warc and commoncrawldump fail after crawl?

问题

我已经成功使用 Nutch 爬取了一个网站,现在我想从结果中创建一个 warc 文件。然而,运行 warc 和 commoncrawldump 命令都失败了。此外,在相同的段文件夹上运行 bin/nutch dump -segment .... 命令可以成功运行。

我正在使用 Nutch v-1.17,并且运行以下命令:

bin/nutch commoncrawldump -outputDir output/ -segment crawl/segments

从 hadoop.log 中的错误是 ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/,尽管刚刚在那里运行了一个爬取操作。

英文:

I have successfully crawled a website using Nutch and now I want to create a warc from the results. However, running both the warc and commoncrawldump commands fail. Also, running bin/nutch dump -segement .... works successfully on the same segment folder.

I am using nutch v-1.17 and running:

bin/nutch commoncrawldump -outputDir output/ -segment crawl/segments

The error from hadoop.log is ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/
despite having just ran a crawl there.

答案1

得分: 0

在“segments”文件夹内,包含了上一次爬取中出现错误的片段。这些片段并未包含所有的片段数据,因为我认为爬取过程被取消或提前完成。这导致整个过程失败。删除所有这些文件并重新开始解决了这个问题。

英文:

Inside the segments folder were segments from a previous crawl that were throwing up the error. They did not contain all the segment data as I believe the crawl was cancelled/finished early. This caused the entire process to fail. Deleting all those files and starting anew fixed the issue.

huangapple
  • 本文由 发表于 2020年9月15日 17:43:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/63899204.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定