问题

我想在Java中进行网页抓取，而Apache Nutch似乎是第一选择。我需要从网站中抓取动态元素，如车辆价格和里程。我已经完成了设置，并尝试执行Nutch来处理seed.txt中的URL - https://www.andersondouglas.com。但在爬行/段中，我只看到一个包含URL名称的文件。我无法看到或找到已爬取网页的HTML内容。
有人能帮忙吗？如何获取HTML内容。

Apache Nutch 版本 1.19

英文:

I want to do scraping in Java, and apache nutch comes to be the first choice. I have to scrape dynamic elements from website like price and mileage of vehicle. I have done the setup and tried to execute nutch for the seed.txt url - https://www.andersondouglas.com. But all i can see in crawl/segments is a file which just contains URL name. I cant see/find the HTML content of the crawled webpage.
Can someone please help. How can i scrape the HTML content.

apache-nutch version 1.19

答案1

得分: 1

以下是获取URL并导出已获取页面的HTML的步骤：

安装Nutch并按Nutch教程中描述的方式配置代理名称。除了代理名称之外，所有其他配置设置均为默认设置。接下来的步骤在空目录中执行。命令nutch代表...nutch_install_path/bin/nutch。
将URL放入种子文件中：echo https://nutch.apache.org/ >seeds.txt
将种子注入到CrawlDb中：nutch inject crawldb seeds.txt
生成一个段落：nutch generate crawldb/ segments/
获取生成的段落：nutch fetch segments/20230310113604/（段落名称是时间戳，需要进行适配）
（可选）解析段落：nutch parse segments/20230310113604/（仅在需要元数据、外链或纯文本时需要）

获取URL的记录（其中包括HTML以及更多信息）：

$> nutch readseg -get segments/20230310113604/ https://nutch.apache.org/
...
内容:
<!DOCTYPE html>
<html lang="en-us">

<head>
  <meta name="generator" content="Hugo 0.92.2" />
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title> Apache Nutch™ </title>
  ...

（作为替代）转储段落：
```
nutch readseg -dump segments/20230310113604/ segdump -recode
```
- HTML文本被写入segdump/dump
- 它被重新编码为UTF-8
- 运行nutch readseg以获取更多命令行选项的帮助信息

英文:

Here the steps to fetch a URL and to export the HTML of the fetched page:

Install Nutch and configure the agent name as described in the Nutch tutorial. Except for the agent name all other configuration settings are the default ones. The next steps are run in an empty directory. The command nutch stands for ...nutch_install_path/bin/nutch.
place the URL into the seed file: echo https://nutch.apache.org/ >seeds.txt
inject the seed into the CrawlDb: nutch inject crawldb seeds.txt
generate a segment: nutch generate crawldb/ segments/
fetch the generated segment: nutch fetch segments/20230310113604/ (the segment name is a time stamp, it needs to be adapted)
(optionally) parse the segment: nutch parse segments/20230310113604/ (only required if metadata, outlinks or plain text are required)

get the record of the URL (it includes the HTML but also more information):

$&gt; nutch readseg -get segments/20230310113604/ https://nutch.apache.org/
...
Content:
&lt;!DOCTYPE html&gt;
&lt;html lang=&quot;en-us&quot;&gt;

&lt;head&gt;
  &lt;meta name=&quot;generator&quot; content=&quot;Hugo 0.92.2&quot; /&gt;
  &lt;meta charset=&quot;utf-8&quot;&gt;
  &lt;meta http-equiv=&quot;X-UA-Compatible&quot; content=&quot;IE=edge&quot;&gt;
  &lt;meta name=&quot;viewport&quot; content=&quot;width=device-width, initial-scale=1&quot;&gt;
  &lt;title&gt; Apache Nutch™ &lt;/title&gt;
  ...

(alternatively) dump the segment:
```
nutch readseg -dump segments/20230310113604/ segdump -recode
```
- the HTML text is written to segdump/dump
- it's recoded to UTF-8
- run nutch readseg to get the help for more command-line options

答案2

得分: 0

页面的原始内容（HTML，也可能是二进制格式，如PDF）存储在子文件夹“content”中的段落中。请注意，内容仅在以下条件下存储：

如果属性fetcher.store.content为真（这是默认值），且
如果抓取成功（尝试获取给定的URL导致HTTP 403禁止）。很可能网站受到保护。

英文:

The raw content of a page (HTML but could be also a binary format such as PDF) is stored in the segments in the subfolder "content". Note, that the content is only stored

if the property fetcher.store.content is true (this is the default) and
if fetching was successful (a trial to fetch the given URL resulted in a HTTP 403 Forbidden). Very likely the site is protected.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Java和Apache Nutch从网站中提取动态元素。

问题

答案1

答案2

网站抓取对医生的工作没有正常运行。

Golang – gomobile基本脚本编译错误

在Selenium Python中的NoSuchElementException错误。

如何在Nim游戏中反转输出？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论