英文:
Integrating Nutch 1.17 with Eclipse (Ubuntu 18.04)
问题
我不知道这份指南是否已经过时,或者是我做错了什么。
我刚开始使用 Nutch,并且已经将其与 Solr 集成,在终端上通过爬取/索引一些网站进行了操作。
现在我正在尝试在 Java 应用程序中使用它们,所以我一直在按照这里的教程进行操作:
https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse#RunNutchInEclipse-RunningNutchinEclipse
我通过 Eclipse 下载了 Subclipse、IvyDE 和 m2e,同时我也下载了 ant,所以我应该已经具备了所有的先决条件。
教程中的 m2e 链接已经失效了,所以我在其他地方找到了它。而且事实证明,在安装 Eclipse 时已经包含了它。
当我在终端上运行 'ant eclipse' 时,我会得到一大堆错误消息。
由于字数限制,这里附上了完整错误消息的 pastebin 链接:
这里
我真的不确定我到底做错了什么。
这些指示并不是特别复杂,所以我真的不知道我在哪里出错了。
以防万一,这是我们需要修改的 nutch-site.xml 内容:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- 将站点特定的属性覆盖放在此文件中。-->
<configuration>
<property>
<name>plugin.folders</name>
<value>/home/user/trunk/build/plugins</value>
</property>
<!-- HTTP 属性 -->
<property>
<name>http.agent.name</name>
<value>MarketDataCrawler</value>
<description>HTTP 'User-Agent' 请求头。绝对不能空白 -
请将其设置为与您的组织唯一相关的单个词。
注意:您还应该检查其他相关属性:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
并根据情况设置其值。
</description>
</property>
<property>
<name>http.robots.agents</name>
<value></value>
<description>除了 'http.agent.name' 之外,机器人解析器在 robots.txt 中查找的任何其他代理。
可以使用逗号作为分隔符提供多个代理。例如,mybot,foo-spider,bar-crawler
代理的顺序无关紧要,机器人解析器将根据首次匹配机器人规则的代理做出决定。
此外,无需将通配符(即“*”)添加到此字符串中,因为机器人解析器会智能地处理不匹配的情况。
如果未指定值,默认情况下,机器人解析器会使用 HTTP 代理(即 'http.agent.name')进行用户代理匹配。
</description>
</property>
</configuration>
很多错误与 Ivy 有关,所以我不知道 Nutch 和在 Eclipse 中安装的插件之间的 Ivy 版本是否兼容。
英文:
I don't know if the guide is possibly outdated, or I'm doing something wrong.
I just started using nutch, and I've integrated it with solr and crawled/indexed through some websites via terminal.
Now I'm trying to use them in a java application, so I've been following the tutorial here:
https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse#RunNutchInEclipse-RunningNutchinEclipse
I downloaded Subclipse, IvyDE and m2e through Eclipse, and I downloaded ant, so I should have all the prerequisites.
The m2e link through the tutorial is broken, so I found it somewhere else. It also turns out that eclipse already had it upon installation.
I get a huge list of error messages when I run 'ant eclipse' in terminal.
Due to word count, put a link to a pastebin with the entire error message
here
I'm really not sure what I'm doing wrong.
The directions aren't that complicated, so I really don't know where I'm messing up.
Just in case it's necessary, here is the nutch-site.xml that we needed to modify.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>plugin.folders</name>
<value>/home/user/trunk/build/plugins</value>
</property>
<!-- HTTP properties -->
<property>
<name>http.agent.name</name>
<value>MarketDataCrawler</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.robots.agents</name>
<value></value>
<description>Any other agents, apart from 'http.agent.name', that the robots
parser would look for in robots.txt. Multiple agents can be provided using
comma as a delimiter. eg. mybot,foo-spider,bar-crawler
The ordering of agents does NOT matter and the robots parser would make
decision based on the agent which matches first to the robots rules.
Also, there is NO need to add a wildcard (ie. "*") to this string as the
robots parser would smartly take care of a no-match situation.
If no value is specified, by default HTTP agent (ie. 'http.agent.name')
would be used for user agent matching by the robots parser.
</description>
</property>
</configuration>
A ton of the errors have to do with Ivy, so I don't know if the versions of Ivy between Nutch and the plugins installed in eclipse are compatible.
答案1
得分: 0
如日志文件中所指导:
[ivy:resolve] SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.pom
[ivy:resolve] SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.jar
[ivy:resolve] SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.pom
您应该在ivy/ivy.xml中使用更新的存储库URL。一种选择是在ivy.xml中将每个URL从http更改为https。
我认为您正在使用某个旧版本,否则这个问题应该已经被修复了。
英文:
As guided in the LOG file
[ivy:resolve] SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.pom
[ivy:resolve] SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.jar
[ivy:resolve] SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.pom
You should use updated repositories URL in ivy/ivy.xml. One option is to change each URL from http to https in ivy.xml.
I think, you are using some old version otherwise this issue should be fixed already.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论