Java HTML XPath 选择器

huangapple go评论78阅读模式
英文:

Java HTML XPath selector

问题

我正在尝试寻找一个类似于 C# 的 htmlagilitypack 的 Java 库,用于解析 HTML 并使用 XPath 选择元素。

我已经阅读了许多库的信息,但它们中没有一个是专门用于 HTML 的独立 XPath 选择器。我找到的所有库都需要使用它们的方法(例如 htmlunit)来解析 HTML。

如果有人能够为我提供一个关于 XPath 2.0 或 3.0 和 HTML 解析的简单示例,我将不胜感激。

英文:

I am trying to find a library like C# htmlagilitypack for java to parse HTML and select elements using XPath.

I have read about many libraries but none of them is standalone XPath selector for HTML, all the libraries that I have found require to parse HTML using their methods like htmlunit.

If someone can guide me with a simple example for XPath 2.0 or 3.0 and HTML parsing I would appreciate it.

答案1

得分: 1

Java支持Xpath,通常用于解析XML文件,但也适用于HTML。

HTML示例:

<html lang="en">
<head>
    <title>Index page</title>
</head>
<body>
<div>
    <br/>
    <h1>Hello <span id="my-demo">User!</span></h1>
    <br/>
    <img src="https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG" alt="photo"/>
</div>
</body>
</html>

代码片段:

public class HtmlXpathParser {
    private DocumentBuilder builder;
    private XPath path;

    public HtmlXpathParser() throws ParserConfigurationException {
        DocumentBuilderFactory dbfactory = DocumentBuilderFactory.newInstance();
        builder = dbfactory.newDocumentBuilder();
        XPathFactory xpfactory = XPathFactory.newInstance();
        path = xpfactory.newXPath();
    }

    public Optional<String> parse(String fileName) throws SAXException, IOException, XPathExpressionException {
        File file = new File(fileName);

        Document doc = builder.parse(file);
        String result = path.evaluate("//img/@src", doc);

        return Optional.of(result);
    }

    public static void main(String[] args) throws ParserConfigurationException, XPathExpressionException, SAXException, IOException {
        HtmlXpathParser parser = new HtmlXpathParser();

        Optional<String> srcResult = parser.parse("src/main/resources/index.html");
        srcResult.ifPresent(System.out::println);
    }
}

输出:

>https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG

它适用于XPath版本1。如果需要,您可以使用类似xpath2-parser的工具。

有用的参考资料:

英文:

Java has support for Xpath. Usually, it used for parsing XML files. However, it should work for HTML as well.

HTML sample:

<html lang="en">
<head>
    <title>Index page</title>
</head>
<body>
<div>
    <br/>
    <h1>Hello <span id="my-demo">User!</span></h1>
    <br/>
    <img src="https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG" alt="photo"/>
</div>
</body>
</html>

Code snippet:

public class HtmlXpathParser {
    private DocumentBuilder builder;
    private XPath path;

    public HtmlXpathParser() throws ParserConfigurationException {
        DocumentBuilderFactory dbfactory = DocumentBuilderFactory.newInstance();
        builder = dbfactory.newDocumentBuilder();
        XPathFactory xpfactory = XPathFactory.newInstance();
        path = xpfactory.newXPath();
    }

    public Optional<String> parse(String fileName) throws SAXException, IOException, XPathExpressionException {
        File file = new File(fileName);

        Document doc = builder.parse(file);
        String result = path.evaluate("//img/@src", doc);

        return Optional.of(result);
    }

    public static void main(String[] args) throws ParserConfigurationException, XPathExpressionException, SAXException, IOException {
        HtmlXpathParser parser = new HtmlXpathParser();

        Optional<String> srcResult = parser.parse("src/main/resources/index.html");
        srcResult.ifPresent(System.out::println);
    }
}

Output:

>https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG

It works for XPath version 1. You could use something like xpath2-parser if you will need it.

Useful references:

huangapple
  • 本文由 发表于 2020年4月9日 18:35:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/61119213.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定