2020年7月25日 23:23:24go评论215阅读模式

英文:

Extract links from a web page in core Java using indexOf, substring vs pattern matching

问题

以下是翻译好的内容：

我正在尝试使用核心Java获取网页中的链接。我正在按照以下代码进行操作，该代码位于<https://stackoverflow.com/questions/5120171/extract-links-from-a-web-page>，并进行了一些修改。

        try {
            url = new URL("http://www.stackoverflow.com");
            is = url.openStream();  // 抛出IOException
            br = new BufferedReader(new InputStreamReader(is));

            while ((line = br.readLine()) != null) {
                if(line.contains("href="))
                    System.out.println(line.trim());
            }
        }

关于提取每个链接，上述帖子中的大多数答案建议使用模式匹配。然而根据我的理解，模式匹配是昂贵的操作。因此，我想使用indexOf和substring操作从每行获取链接文本，如下所示：

   private static Set<String> getUrls(String line, int firstIndexOfHref) {
        int startIndex = firstIndexOfHref;
        int endIndex;
        Set<String> urls = new HashSet<>();

        while(startIndex != -1) {
            try {
                endIndex = line.indexOf("\"", startIndex + 6);
                String url = line.substring(startIndex + 6, endIndex);
                urls.add(url);
                startIndex =  line.indexOf("href=\"http", endIndex);
            } catch (Exception e) {
                e.printStackTrace();
            }
        }

        return urls;
    }

我在一些页面上尝试过，它在正常工作。然而，我不确定这种方法是否总是有效的。我想知道这个逻辑在某些实时情况下是否会失败。

请帮忙。

英文:

I am trying to get the links in a web page using core java. I am following the below code given in <https://stackoverflow.com/questions/5120171/extract-links-from-a-web-page> with some modifications.

        try {
            url = new URL(&quot;http://www.stackoverflow.com&quot;);
            is = url.openStream();  // throws an IOException
            br = new BufferedReader(new InputStreamReader(is));

            while ((line = br.readLine()) != null) {
                if(line.contains(&quot;href=&quot;))
                    System.out.println(line.trim());
            }
        }

With respect extracting each link, most of the answers in the above post suggests using pattern matching. However as per my understanding Pattern matching is expensive operation. So I want to use indexOf and substring operations to get the link text from each line as below

   private static Set&lt;String&gt; getUrls(String line, int firstIndexOfHref) {
        int startIndex = firstIndexOfHref;
        int endIndex;
        Set&lt;String&gt; urls = new HashSet&lt;&gt;();

        while(startIndex != -1) {
            try {
                endIndex = line.indexOf(&quot;\&quot;&quot;, startIndex + 6);
                String url = line.substring(startIndex + 6, endIndex);
                urls.add(url);
                startIndex =  line.indexOf(&quot;href=\&quot;http&quot;, endIndex);
            } catch (Exception e) {
                e.printStackTrace();
            }
        }

        return urls;
    }

I have tried this on few pages and it's working properly.
However I am not sure if this approach always works. I want to know if this logic can fail in some real time scenarios.

Please help.

答案1

得分: 1

你的代码依赖于一行中的良好HTML格式，它不会处理其他各种引用<a href的方式，比如使用单引号、无引号、额外的空白，包括在“a”和“href”之间以及“=”之间的换行，相对路径，以及文件：或ftp：等其他协议。

一些你需要考虑的例子：

&lt;a href 
   =/questions/63090090/extract-links-from-a-web-page-in-core-java-using-indexof-substring-vs-pattern-m

或者

&lt;a href = &#39;http://host&#39;

或者

&lt;a 
href = &#39;http://host&#39;

这就是为什么其他问题有很多答案，包括HTML验证器和正则表达式模式。

英文:

Your code is relying a good format of html in one line, it will not handle various other ways to reference <a href such as with single quotes, no quotes, extra whitespace including new lines between "a" and "href" and "=", relative paths, other protocols such as file: or ftp:.

Some examples you would need to consider:

&lt;a href 
   =/questions/63090090/extract-links-from-a-web-page-in-core-java-using-indexof-substring-vs-pattern-m

&lt;a href = &#39;http://host&#39;

&lt;a 
href = &#39;http://host&#39;

That's why the other question has many answers including HTML validator, and regex patterns.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用indexOf、substring与模式匹配在核心Java中从网页中提取链接。

问题

答案1

Azure Identity => ERROR in getToken() call with microsoft-graph

How do I return a boolean if a value is present in a list inside a list using java 8

对称DS客户端抛出AuthenticationException。

如何在通用类中获取注入的Bean的值

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论