2020年4月9日 18:35:53go评论85阅读模式

英文:

Java HTML XPath selector

问题

我正在尝试寻找一个类似于 C# 的 htmlagilitypack 的 Java 库，用于解析 HTML 并使用 XPath 选择元素。

我已经阅读了许多库的信息，但它们中没有一个是专门用于 HTML 的独立 XPath 选择器。我找到的所有库都需要使用它们的方法（例如 htmlunit）来解析 HTML。

如果有人能够为我提供一个关于 XPath 2.0 或 3.0 和 HTML 解析的简单示例，我将不胜感激。

英文:

I am trying to find a library like C# htmlagilitypack for java to parse HTML and select elements using XPath.

I have read about many libraries but none of them is standalone XPath selector for HTML, all the libraries that I have found require to parse HTML using their methods like htmlunit.

If someone can guide me with a simple example for XPath 2.0 or 3.0 and HTML parsing I would appreciate it.

答案1

得分: 1

Java支持Xpath，通常用于解析XML文件，但也适用于HTML。

HTML示例：

&lt;html lang=&quot;en&quot;&gt;
&lt;head&gt;
    &lt;title&gt;Index page&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;div&gt;
    &lt;br/&gt;
    &lt;h1&gt;Hello &lt;span id=&quot;my-demo&quot;&gt;User！&lt;/span&gt;&lt;/h1&gt;
    &lt;br/&gt;
    &lt;img src=&quot;https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG&quot; alt=&quot;photo&quot;/&gt;
&lt;/div&gt;
&lt;/body&gt;
&lt;/html&gt;

代码片段：

public class HtmlXpathParser {
    private DocumentBuilder builder;
    private XPath path;

    public HtmlXpathParser() throws ParserConfigurationException {
        DocumentBuilderFactory dbfactory = DocumentBuilderFactory.newInstance();
        builder = dbfactory.newDocumentBuilder();
        XPathFactory xpfactory = XPathFactory.newInstance();
        path = xpfactory.newXPath();
    }

    public Optional&lt;String&gt; parse(String fileName) throws SAXException, IOException, XPathExpressionException {
        File file = new File(fileName);

        Document doc = builder.parse(file);
        String result = path.evaluate(&quot;//img/@src&quot;, doc);

        return Optional.of(result);
    }

    public static void main(String[] args) throws ParserConfigurationException, XPathExpressionException, SAXException, IOException {
        HtmlXpathParser parser = new HtmlXpathParser();

        Optional&lt;String&gt; srcResult = parser.parse(&quot;src/main/resources/index.html&quot;);
        srcResult.ifPresent(System.out::println);
    }
}

输出：

>https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG

它适用于XPath版本1。如果需要，您可以使用类似xpath2-parser的工具。

有用的参考资料：

英文:

Java has support for Xpath. Usually, it used for parsing XML files. However, it should work for HTML as well.

HTML sample:

&lt;html lang=&quot;en&quot;&gt;
&lt;head&gt;
    &lt;title&gt;Index page&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;div&gt;
    &lt;br/&gt;
    &lt;h1&gt;Hello &lt;span id=&quot;my-demo&quot;&gt;User!&lt;/span&gt;&lt;/h1&gt;
    &lt;br/&gt;
    &lt;img src=&quot;https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG&quot; alt=&quot;photo&quot;/&gt;
&lt;/div&gt;
&lt;/body&gt;
&lt;/html&gt;

Code snippet:

public class HtmlXpathParser {
    private DocumentBuilder builder;
    private XPath path;

    public HtmlXpathParser() throws ParserConfigurationException {
        DocumentBuilderFactory dbfactory = DocumentBuilderFactory.newInstance();
        builder = dbfactory.newDocumentBuilder();
        XPathFactory xpfactory = XPathFactory.newInstance();
        path = xpfactory.newXPath();
    }

    public Optional&lt;String&gt; parse(String fileName) throws SAXException, IOException, XPathExpressionException {
        File file = new File(fileName);

        Document doc = builder.parse(file);
        String result = path.evaluate(&quot;//img/@src&quot;, doc);

        return Optional.of(result);
    }

    public static void main(String[] args) throws ParserConfigurationException, XPathExpressionException, SAXException, IOException {
        HtmlXpathParser parser = new HtmlXpathParser();

        Optional&lt;String&gt; srcResult = parser.parse(&quot;src/main/resources/index.html&quot;);
        srcResult.ifPresent(System.out::println);
    }
}

Output:

>https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG

It works for XPath version 1. You could use something like xpath2-parser if you will need it.

Useful references:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Java HTML XPath 选择器

问题

答案1

在App Engine和Firestore中的内存问题。

JS元素从文本区域检索值时返回’null’。

运行时发生的Java类文件错误

使用仪器更改类的父类

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论