Integrating Nutch 1.17 with Eclipse (Ubuntu 18.04)

huangapple go评论121阅读模式
英文:

Integrating Nutch 1.17 with Eclipse (Ubuntu 18.04)

问题

我不知道这份指南是否已经过时,或者是我做错了什么。
我刚开始使用 Nutch,并且已经将其与 Solr 集成,在终端上通过爬取/索引一些网站进行了操作。
现在我正在尝试在 Java 应用程序中使用它们,所以我一直在按照这里的教程进行操作:
https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse#RunNutchInEclipse-RunningNutchinEclipse

我通过 Eclipse 下载了 Subclipse、IvyDE 和 m2e,同时我也下载了 ant,所以我应该已经具备了所有的先决条件。
教程中的 m2e 链接已经失效了,所以我在其他地方找到了它。而且事实证明,在安装 Eclipse 时已经包含了它。

当我在终端上运行 'ant eclipse' 时,我会得到一大堆错误消息。
由于字数限制,这里附上了完整错误消息的 pastebin 链接:
这里

我真的不确定我到底做错了什么。
这些指示并不是特别复杂,所以我真的不知道我在哪里出错了。

以防万一,这是我们需要修改的 nutch-site.xml 内容:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- 将站点特定的属性覆盖放在此文件中。-->

<configuration>

<property>
   <name>plugin.folders</name>
   <value>/home/user/trunk/build/plugins</value>
</property>

<!-- HTTP 属性 -->

<property>
  <name>http.agent.name</name>
  <value>MarketDataCrawler</value>
  <description>HTTP 'User-Agent' 请求头。绝对不能空白 -
  请将其设置为与您的组织唯一相关的单个词。

  注意:您还应该检查其他相关属性:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

  并根据情况设置其值。

  </description>
</property>

<property>
  <name>http.robots.agents</name>
  <value></value>
  <description>除了 'http.agent.name' 之外,机器人解析器在 robots.txt 中查找的任何其他代理。
  可以使用逗号作为分隔符提供多个代理。例如,mybot,foo-spider,bar-crawler
  
  代理的顺序无关紧要,机器人解析器将根据首次匹配机器人规则的代理做出决定。
  此外,无需将通配符(即“*”)添加到此字符串中,因为机器人解析器会智能地处理不匹配的情况。
    
  如果未指定值,默认情况下,机器人解析器会使用 HTTP 代理(即 'http.agent.name')进行用户代理匹配。
  </description>
</property>

</configuration>

很多错误与 Ivy 有关,所以我不知道 Nutch 和在 Eclipse 中安装的插件之间的 Ivy 版本是否兼容。

英文:

I don't know if the guide is possibly outdated, or I'm doing something wrong.
I just started using nutch, and I've integrated it with solr and crawled/indexed through some websites via terminal.
Now I'm trying to use them in a java application, so I've been following the tutorial here:
https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse#RunNutchInEclipse-RunningNutchinEclipse

I downloaded Subclipse, IvyDE and m2e through Eclipse, and I downloaded ant, so I should have all the prerequisites.
The m2e link through the tutorial is broken, so I found it somewhere else. It also turns out that eclipse already had it upon installation.

I get a huge list of error messages when I run 'ant eclipse' in terminal.
Due to word count, put a link to a pastebin with the entire error message
here

I'm really not sure what I'm doing wrong.
The directions aren't that complicated, so I really don't know where I'm messing up.

Just in case it's necessary, here is the nutch-site.xml that we needed to modify.

&lt;?xml version=&quot;1.0&quot;?&gt;
&lt;?xml-stylesheet type=&quot;text/xsl&quot; href=&quot;configuration.xsl&quot;?&gt;

&lt;!-- Put site-specific property overrides in this file. --&gt;

&lt;configuration&gt;

&lt;property&gt;
   &lt;name&gt;plugin.folders&lt;/name&gt;
   &lt;value&gt;/home/user/trunk/build/plugins&lt;/value&gt;
&lt;/property&gt;

&lt;!-- HTTP properties --&gt;

&lt;property&gt;
  &lt;name&gt;http.agent.name&lt;/name&gt;
  &lt;value&gt;MarketDataCrawler&lt;/value&gt;
  &lt;description&gt;HTTP &#39;User-Agent&#39; request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

  and set their values appropriately.

  &lt;/description&gt;
&lt;/property&gt;

&lt;property&gt;
  &lt;name&gt;http.robots.agents&lt;/name&gt;
  &lt;value&gt;&lt;/value&gt;
  &lt;description&gt;Any other agents, apart from &#39;http.agent.name&#39;, that the robots
  parser would look for in robots.txt. Multiple agents can be provided using 
  comma as a delimiter. eg. mybot,foo-spider,bar-crawler
  
  The ordering of agents does NOT matter and the robots parser would make 
  decision based on the agent which matches first to the robots rules.  
  Also, there is NO need to add a wildcard (ie. &quot;*&quot;) to this string as the 
  robots parser would smartly take care of a no-match situation. 
    
  If no value is specified, by default HTTP agent (ie. &#39;http.agent.name&#39;) 
  would be used for user agent matching by the robots parser. 
  &lt;/description&gt;
&lt;/property&gt;

&lt;/configuration&gt;

A ton of the errors have to do with Ivy, so I don't know if the versions of Ivy between Nutch and the plugins installed in eclipse are compatible.

答案1

得分: 0

如日志文件中所指导:

[ivy:resolve] 	SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.pom
[ivy:resolve] 	SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.jar
[ivy:resolve] 	SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.pom

您应该在ivy/ivy.xml中使用更新的存储库URL。一种选择是在ivy.xml中将每个URL从http更改为https。

我认为您正在使用某个旧版本,否则这个问题应该已经被修复了。

英文:

As guided in the LOG file

[ivy:resolve] 	SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.pom
[ivy:resolve] 	SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.jar
[ivy:resolve] 	SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.pom

You should use updated repositories URL in ivy/ivy.xml. One option is to change each URL from http to https in ivy.xml.

I think, you are using some old version otherwise this issue should be fixed already.

huangapple
  • 本文由 发表于 2020年9月24日 05:22:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/64036421.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定