使用Java和Apache Nutch从网站中提取动态元素。

huangapple go评论67阅读模式
英文:

Using Java & Apache Nutch to scrape dynamic elements from a website

问题

我想在Java中进行网页抓取,而Apache Nutch似乎是第一选择。我需要从网站中抓取动态元素,如车辆价格和里程。我已经完成了设置,并尝试执行Nutch来处理seed.txt中的URL - https://www.andersondouglas.com。但在爬行/段中,我只看到一个包含URL名称的文件。我无法看到或找到已爬取网页的HTML内容。
有人能帮忙吗?如何获取HTML内容。

Apache Nutch 版本 1.19

英文:

I want to do scraping in Java, and apache nutch comes to be the first choice. I have to scrape dynamic elements from website like price and mileage of vehicle. I have done the setup and tried to execute nutch for the seed.txt url - https://www.andersondouglas.com. But all i can see in crawl/segments is a file which just contains URL name. I cant see/find the HTML content of the crawled webpage.
Can someone please help. How can i scrape the HTML content.

apache-nutch version 1.19

答案1

得分: 1

以下是获取URL并导出已获取页面的HTML的步骤:

  1. 安装Nutch并按Nutch教程中描述的方式配置代理名称。除了代理名称之外,所有其他配置设置均为默认设置。接下来的步骤在空目录中执行。命令nutch代表...nutch_install_path/bin/nutch
  2. 将URL放入种子文件中:echo https://nutch.apache.org/ >seeds.txt
  3. 将种子注入到CrawlDb中:nutch inject crawldb seeds.txt
  4. 生成一个段落:nutch generate crawldb/ segments/
  5. 获取生成的段落:nutch fetch segments/20230310113604/(段落名称是时间戳,需要进行适配)
  6. (可选)解析段落:nutch parse segments/20230310113604/(仅在需要元数据、外链或纯文本时需要)
  7. 获取URL的记录(其中包括HTML以及更多信息):
    $> nutch readseg -get segments/20230310113604/ https://nutch.apache.org/
    ...
    内容:
    <!DOCTYPE html>
    <html lang="en-us">
    
    <head>
      <meta name="generator" content="Hugo 0.92.2" />
      <meta charset="utf-8">
      <meta http-equiv="X-UA-Compatible" content="IE edge">
      <meta name="viewport" content="width=device-width, initial-scale=1">
      <title> Apache Nutch™ </title>
      ...
    
  8. (作为替代)转储段落:
    nutch readseg -dump segments/20230310113604/ segdump -recode
    
    • HTML文本被写入segdump/dump
    • 它被重新编码为UTF-8
    • 运行nutch readseg以获取更多命令行选项的帮助信息
英文:

Here the steps to fetch a URL and to export the HTML of the fetched page:

  1. Install Nutch and configure the agent name as described in the Nutch tutorial. Except for the agent name all other configuration settings are the default ones. The next steps are run in an empty directory. The command nutch stands for ...nutch_install_path/bin/nutch.
  2. place the URL into the seed file: echo https://nutch.apache.org/ &gt;seeds.txt
  3. inject the seed into the CrawlDb: nutch inject crawldb seeds.txt
  4. generate a segment: nutch generate crawldb/ segments/
  5. fetch the generated segment: nutch fetch segments/20230310113604/ (the segment name is a time stamp, it needs to be adapted)
  6. (optionally) parse the segment: nutch parse segments/20230310113604/ (only required if metadata, outlinks or plain text are required)
  7. get the record of the URL (it includes the HTML but also more information):
    $&gt; nutch readseg -get segments/20230310113604/ https://nutch.apache.org/
    ...
    Content:
    &lt;!DOCTYPE html&gt;
    &lt;html lang=&quot;en-us&quot;&gt;
    
    &lt;head&gt;
      &lt;meta name=&quot;generator&quot; content=&quot;Hugo 0.92.2&quot; /&gt;
      &lt;meta charset=&quot;utf-8&quot;&gt;
      &lt;meta http-equiv=&quot;X-UA-Compatible&quot; content=&quot;IE=edge&quot;&gt;
      &lt;meta name=&quot;viewport&quot; content=&quot;width=device-width, initial-scale=1&quot;&gt;
      &lt;title&gt; Apache Nutch™ &lt;/title&gt;
      ...
    
  8. (alternatively) dump the segment:
    nutch readseg -dump segments/20230310113604/ segdump -recode
    
    • the HTML text is written to segdump/dump
    • it's recoded to UTF-8
    • run nutch readseg to get the help for more command-line options

答案2

得分: 0

页面的原始内容(HTML,也可能是二进制格式,如PDF)存储在子文件夹“content”中的段落中。请注意,内容仅在以下条件下存储:

  • 如果属性fetcher.store.content为真(这是默认值),且
  • 如果抓取成功(尝试获取给定的URL导致HTTP 403禁止)。很可能网站受到保护。
英文:

The raw content of a page (HTML but could be also a binary format such as PDF) is stored in the segments in the subfolder "content". Note, that the content is only stored

  • if the property fetcher.store.content is true (this is the default) and
  • if fetching was successful (a trial to fetch the given URL resulted in a HTTP 403 Forbidden). Very likely the site is protected.

huangapple
  • 本文由 发表于 2023年3月9日 17:01:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75682386.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定