英文:
How to strip down a string to the first duplicate?
问题
我只想提取第一行内的链接(https://www.apple.com/ca/
),并忽略其余的HTML和代码。我该如何做?
英文:
I have converted a few web pages into string and the string contains these lines(along with other code):
<div class="r"><a href="https://www.apple.com/ca/"
<div class="r"><a href="https://www.facebook.com/ca/"
<div class="r"><a href="https://www.utorrent.com/ca/"
but I just want to strip out the link inside the first line(https://www.apple.com/ca/
) and ignore the rest of the HTML and the code. How do I do that?
答案1
得分: 2
以下是翻译好的内容:
简单的方法:
String url = input.replaceAll("(?s).*?href=\"(.*?)\"", "$1");
为什么这段代码有效的关键点:
- 正则表达式匹配整个输入,但捕获了目标部分。替换内容是捕获的部分(第1组)。这种方法有效地 提取 了目标部分。
(?s)
表示“点号匹配换行符”。.*?
勉强地(尽可能少地)匹配到 "href""。(.*?)
勉强地 捕获到 "quot;" 之前的所有内容。.*
贪婪地(尽可能多地)匹配剩余部分(由于上面的(?s)
)。- 替换内容为
$1
- 匹配中的第一个(也是唯一的)组。
英文:
The easy way:
String url = input.replaceAll("(?s).*?href=\"(.*?)\".*", "$1");
Key points of why this works:
- regex matches the whole input, but captures the target. The replacement is the capture (group #1). This approach effectively extracts the target
(?s)
means “dot matches newline”.*?
is reluctantly (as little input as possible) matches up to “href"”(.*?)
capture (reluctantly) everything up to “"”.*
greedily (as much as possible) matches the rest (thanks to(?s)
above)- replacement is
$1
- the first (and only) group in the match
答案2
得分: 1
使用在这个答案中提到的正则表达式,以下是使用Java正则表达式API的解决方案:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String str = "<div class=\"r\"><a href=\"https://www.apple.com/ca/\">Hello</a>\n"
+ "<div class=\"r\"><a href=\"https://www.facebook.com/ca/\">Hello</a>\n"
+ "<div class=\"r\"><a href=\"https://www.utorrent.com/ca/\">Hello</a>";
String regex = "\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
输出:
https://www.apple.com/ca/
https://www.facebook.com/ca/
https://www.utorrent.com/ca/
英文:
Using the regex mentioned in the answer, given below is the solution using the Java regex API:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String str = "<div class=\"r\"><a href=\"https://www.apple.com/ca/\">Hello</a>\n"
+ "<div class=\"r\"><a href=\"https://www.facebook.com/ca/\">Hello</a>\n"
+ "<div class=\"r\"><a href=\"https://www.utorrent.com/ca/\">Hello</a>";
String regex = "\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
Output:
https://www.apple.com/ca/
https://www.facebook.com/ca/
https://www.utorrent.com/ca/
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论