英文:
Java HTML XPath selector
问题
我正在尝试寻找一个类似于 C# 的 htmlagilitypack
的 Java 库,用于解析 HTML 并使用 XPath 选择元素。
我已经阅读了许多库的信息,但它们中没有一个是专门用于 HTML 的独立 XPath 选择器。我找到的所有库都需要使用它们的方法(例如 htmlunit
)来解析 HTML。
如果有人能够为我提供一个关于 XPath 2.0 或 3.0 和 HTML 解析的简单示例,我将不胜感激。
英文:
I am trying to find a library like C# htmlagilitypack
for java to parse HTML and select elements using XPath.
I have read about many libraries but none of them is standalone XPath selector for HTML, all the libraries that I have found require to parse HTML using their methods like htmlunit
.
If someone can guide me with a simple example for XPath 2.0 or 3.0 and HTML parsing I would appreciate it.
答案1
得分: 1
Java支持Xpath,通常用于解析XML文件,但也适用于HTML。
HTML示例:
<html lang="en">
<head>
<title>Index page</title>
</head>
<body>
<div>
<br/>
<h1>Hello <span id="my-demo">User!</span></h1>
<br/>
<img src="https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG" alt="photo"/>
</div>
</body>
</html>
代码片段:
public class HtmlXpathParser {
private DocumentBuilder builder;
private XPath path;
public HtmlXpathParser() throws ParserConfigurationException {
DocumentBuilderFactory dbfactory = DocumentBuilderFactory.newInstance();
builder = dbfactory.newDocumentBuilder();
XPathFactory xpfactory = XPathFactory.newInstance();
path = xpfactory.newXPath();
}
public Optional<String> parse(String fileName) throws SAXException, IOException, XPathExpressionException {
File file = new File(fileName);
Document doc = builder.parse(file);
String result = path.evaluate("//img/@src", doc);
return Optional.of(result);
}
public static void main(String[] args) throws ParserConfigurationException, XPathExpressionException, SAXException, IOException {
HtmlXpathParser parser = new HtmlXpathParser();
Optional<String> srcResult = parser.parse("src/main/resources/index.html");
srcResult.ifPresent(System.out::println);
}
}
输出:
>https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG
它适用于XPath版本1。如果需要,您可以使用类似xpath2-parser的工具。
有用的参考资料:
英文:
Java has support for Xpath. Usually, it used for parsing XML files. However, it should work for HTML as well.
HTML sample:
<html lang="en">
<head>
<title>Index page</title>
</head>
<body>
<div>
<br/>
<h1>Hello <span id="my-demo">User!</span></h1>
<br/>
<img src="https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG" alt="photo"/>
</div>
</body>
</html>
Code snippet:
public class HtmlXpathParser {
private DocumentBuilder builder;
private XPath path;
public HtmlXpathParser() throws ParserConfigurationException {
DocumentBuilderFactory dbfactory = DocumentBuilderFactory.newInstance();
builder = dbfactory.newDocumentBuilder();
XPathFactory xpfactory = XPathFactory.newInstance();
path = xpfactory.newXPath();
}
public Optional<String> parse(String fileName) throws SAXException, IOException, XPathExpressionException {
File file = new File(fileName);
Document doc = builder.parse(file);
String result = path.evaluate("//img/@src", doc);
return Optional.of(result);
}
public static void main(String[] args) throws ParserConfigurationException, XPathExpressionException, SAXException, IOException {
HtmlXpathParser parser = new HtmlXpathParser();
Optional<String> srcResult = parser.parse("src/main/resources/index.html");
srcResult.ifPresent(System.out::println);
}
}
Output:
>https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG
It works for XPath version 1. You could use something like xpath2-parser if you will need it.
Useful references:
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论