英文:
Extract links from a web page in core Java using indexOf, substring vs pattern matching
问题
以下是翻译好的内容:
我正在尝试使用核心Java获取网页中的链接。我正在按照以下代码进行操作,该代码位于<https://stackoverflow.com/questions/5120171/extract-links-from-a-web-page>,并进行了一些修改。
try {
url = new URL("http://www.stackoverflow.com");
is = url.openStream(); // 抛出IOException
br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null) {
if(line.contains("href="))
System.out.println(line.trim());
}
}
关于提取每个链接,上述帖子中的大多数答案建议使用模式匹配。然而根据我的理解,模式匹配是昂贵的操作。因此,我想使用indexOf和substring操作从每行获取链接文本,如下所示:
private static Set<String> getUrls(String line, int firstIndexOfHref) {
int startIndex = firstIndexOfHref;
int endIndex;
Set<String> urls = new HashSet<>();
while(startIndex != -1) {
try {
endIndex = line.indexOf("\"", startIndex + 6);
String url = line.substring(startIndex + 6, endIndex);
urls.add(url);
startIndex = line.indexOf("href=\"http", endIndex);
} catch (Exception e) {
e.printStackTrace();
}
}
return urls;
}
我在一些页面上尝试过,它在正常工作。然而,我不确定这种方法是否总是有效的。我想知道这个逻辑在某些实时情况下是否会失败。
请帮忙。
英文:
I am trying to get the links in a web page using core java. I am following the below code given in <https://stackoverflow.com/questions/5120171/extract-links-from-a-web-page> with some modifications.
try {
url = new URL("http://www.stackoverflow.com");
is = url.openStream(); // throws an IOException
br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null) {
if(line.contains("href="))
System.out.println(line.trim());
}
}
With respect extracting each link, most of the answers in the above post suggests using pattern matching. However as per my understanding Pattern matching is expensive operation. So I want to use indexOf and substring operations to get the link text from each line as below
private static Set<String> getUrls(String line, int firstIndexOfHref) {
int startIndex = firstIndexOfHref;
int endIndex;
Set<String> urls = new HashSet<>();
while(startIndex != -1) {
try {
endIndex = line.indexOf("\"", startIndex + 6);
String url = line.substring(startIndex + 6, endIndex);
urls.add(url);
startIndex = line.indexOf("href=\"http", endIndex);
} catch (Exception e) {
e.printStackTrace();
}
}
return urls;
}
I have tried this on few pages and it's working properly.
However I am not sure if this approach always works. I want to know if this logic can fail in some real time scenarios.
Please help.
答案1
得分: 1
你的代码依赖于一行中的良好HTML格式,它不会处理其他各种引用<a href
的方式,比如使用单引号、无引号、额外的空白,包括在“a”和“href”之间以及“=”之间的换行,相对路径,以及文件:或ftp:等其他协议。
一些你需要考虑的例子:
<a href
=/questions/63090090/extract-links-from-a-web-page-in-core-java-using-indexof-substring-vs-pattern-m
或者
<a href = 'http://host'
或者
<a
href = 'http://host'
这就是为什么其他问题有很多答案,包括HTML验证器和正则表达式模式。
英文:
Your code is relying a good format of html in one line, it will not handle various other ways to reference <a href
such as with single quotes, no quotes, extra whitespace including new lines between "a" and "href" and "=", relative paths, other protocols such as file: or ftp:.
Some examples you would need to consider:
<a href
=/questions/63090090/extract-links-from-a-web-page-in-core-java-using-indexof-substring-vs-pattern-m
or
<a href = 'http://host'
or
<a
href = 'http://host'
That's why the other question has many answers including HTML validator, and regex patterns.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论