英文:
Using Java & Apache Nutch to scrape dynamic elements from a website
问题
我想在Java中进行网页抓取,而Apache Nutch似乎是第一选择。我需要从网站中抓取动态元素,如车辆价格和里程。我已经完成了设置,并尝试执行Nutch来处理seed.txt中的URL - https://www.andersondouglas.com。但在爬行/段中,我只看到一个包含URL名称的文件。我无法看到或找到已爬取网页的HTML内容。
有人能帮忙吗?如何获取HTML内容。
Apache Nutch 版本 1.19
英文:
I want to do scraping in Java, and apache nutch comes to be the first choice. I have to scrape dynamic elements from website like price and mileage of vehicle. I have done the setup and tried to execute nutch for the seed.txt url - https://www.andersondouglas.com. But all i can see in crawl/segments is a file which just contains URL name. I cant see/find the HTML content of the crawled webpage.
Can someone please help. How can i scrape the HTML content.
apache-nutch version 1.19
答案1
得分: 1
以下是获取URL并导出已获取页面的HTML的步骤:
- 安装Nutch并按Nutch教程中描述的方式配置代理名称。除了代理名称之外,所有其他配置设置均为默认设置。接下来的步骤在空目录中执行。命令
nutch
代表...nutch_install_path/bin/nutch
。 - 将URL放入种子文件中:
echo https://nutch.apache.org/ >seeds.txt
- 将种子注入到CrawlDb中:
nutch inject crawldb seeds.txt
- 生成一个段落:
nutch generate crawldb/ segments/
- 获取生成的段落:
nutch fetch segments/20230310113604/
(段落名称是时间戳,需要进行适配) - (可选)解析段落:
nutch parse segments/20230310113604/
(仅在需要元数据、外链或纯文本时需要) - 获取URL的记录(其中包括HTML以及更多信息):
$> nutch readseg -get segments/20230310113604/ https://nutch.apache.org/ ... 内容: <!DOCTYPE html> <html lang="en-us"> <head> <meta name="generator" content="Hugo 0.92.2" /> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title> Apache Nutch™ </title> ...
- (作为替代)转储段落:
nutch readseg -dump segments/20230310113604/ segdump -recode
- HTML文本被写入
segdump/dump
- 它被重新编码为UTF-8
- 运行
nutch readseg
以获取更多命令行选项的帮助信息
- HTML文本被写入
英文:
Here the steps to fetch a URL and to export the HTML of the fetched page:
- Install Nutch and configure the agent name as described in the Nutch tutorial. Except for the agent name all other configuration settings are the default ones. The next steps are run in an empty directory. The command
nutch
stands for...nutch_install_path/bin/nutch
. - place the URL into the seed file:
echo https://nutch.apache.org/ >seeds.txt
- inject the seed into the CrawlDb:
nutch inject crawldb seeds.txt
- generate a segment:
nutch generate crawldb/ segments/
- fetch the generated segment:
nutch fetch segments/20230310113604/
(the segment name is a time stamp, it needs to be adapted) - (optionally) parse the segment:
nutch parse segments/20230310113604/
(only required if metadata, outlinks or plain text are required) - get the record of the URL (it includes the HTML but also more information):
$> nutch readseg -get segments/20230310113604/ https://nutch.apache.org/ ... Content: <!DOCTYPE html> <html lang="en-us"> <head> <meta name="generator" content="Hugo 0.92.2" /> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title> Apache Nutch™ </title> ...
- (alternatively) dump the segment:
nutch readseg -dump segments/20230310113604/ segdump -recode
- the HTML text is written to
segdump/dump
- it's recoded to UTF-8
- run
nutch readseg
to get the help for more command-line options
- the HTML text is written to
答案2
得分: 0
页面的原始内容(HTML,也可能是二进制格式,如PDF)存储在子文件夹“content”中的段落中。请注意,内容仅在以下条件下存储:
- 如果属性
fetcher.store.content
为真(这是默认值),且 - 如果抓取成功(尝试获取给定的URL导致HTTP 403禁止)。很可能网站受到保护。
英文:
The raw content of a page (HTML but could be also a binary format such as PDF) is stored in the segments in the subfolder "content". Note, that the content is only stored
- if the property
fetcher.store.content
is true (this is the default) and - if fetching was successful (a trial to fetch the given URL resulted in a HTTP 403 Forbidden). Very likely the site is protected.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论