问题

我已经成功使用 Nutch 爬取了一个网站，现在我想从结果中创建一个 warc 文件。然而，运行 warc 和 commoncrawldump 命令都失败了。此外，在相同的段文件夹上运行 bin/nutch dump -segment .... 命令可以成功运行。

我正在使用 Nutch v-1.17，并且运行以下命令：

bin/nutch commoncrawldump -outputDir output/ -segment crawl/segments

从 hadoop.log 中的错误是 ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/，尽管刚刚在那里运行了一个爬取操作。

英文:

I have successfully crawled a website using Nutch and now I want to create a warc from the results. However, running both the warc and commoncrawldump commands fail. Also, running bin/nutch dump -segement .... works successfully on the same segment folder.

I am using nutch v-1.17 and running:

bin/nutch commoncrawldump -outputDir output/ -segment crawl/segments

The error from hadoop.log is ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/
despite having just ran a crawl there.

答案1

得分: 0

在“segments”文件夹内，包含了上一次爬取中出现错误的片段。这些片段并未包含所有的片段数据，因为我认为爬取过程被取消或提前完成。这导致整个过程失败。删除所有这些文件并重新开始解决了这个问题。

英文:

Inside the segments folder were segments from a previous crawl that were throwing up the error. They did not contain all the segment data as I believe the crawl was cancelled/finished early. This caused the entire process to fail. Deleting all those files and starting anew fixed the issue.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么我的Apache Nutch在抓取后无法生成warc文件和commoncrawldump文件？

问题

答案1

如何连接到IBM MQ统一集群

有没有可能在没有锁的情况下快速解决多线程银行账户问题？

Java：在使用getter和setter访问另一个类的私有变量时出现NoSuchMethodError。

如何正确使用反射（特殊情况）JAVA

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论