英文:
Extraction of string from particular tag using java
问题
为什么我没有得到预期的输出?
您没有得到预期的输出是因为您的正则表达式匹配过于贪婪。在正则表达式中,通常使用.*
来匹配任意字符,但默认情况下它是贪婪的,会尽可能多地匹配字符。这导致您的正则表达式匹配了所有位于第一个<AT>
和最后一个</AT>
之间的文本。
要修复这个问题,您可以将正则表达式改为非贪婪匹配,使用.*?
代替.*
。这样正则表达式将尽可能少地匹配字符,以便找到最近的<AT>
和</AT>
标签对。
以下是修正后的正则表达式和代码:
private static final Pattern TAG_REGEX = Pattern.compile("<AT>(.*?)</AT>");
public static void getText(String text) {
final Matcher matcher = TAG_REGEX.matcher(text);
while (matcher.find()) {
String url = matcher.group(1);
System.out.println("Extracted URL::" + url);
}
}
使用这个修正后的正则表达式,您应该能够得到预期的输出:
Extracted URL::EXTRACT_URL
Extracted URL::EXTRACT_URL
Extracted URL::EXTRACT_URL
Extracted URL::EXTRACT_URL
英文:
I am having few tags inside of html. As you can see in below HTML having <AT></AT>
. So I need to extract text from <AT></AT> this tag
.
I have followed below approach
- Written one regex what will extract text from AT tag
Below is testing string::
href="<AT>EXTRACT_URL</AT>" target="_blank" style="font-weight: bold;letter-spacing: normal;line-height: 100%;text-align: center;text-decoration: none;color: #FFFFFF;">Sign In</a></td></tr></tbody></table></td></tr> <a href="<AT>EXTRACT_URL</AT>" target="_blank" title="" class="" target="_blank"> <a href="<AT>EXTRACT_URL</AT>" target="_blank" title="" class="" target="_blank"> <a href="<AT>EXTRACT_URL</AT>" target="_blank" title="" class="" target="_blank">
Used below program for extracting text from AT Tag
private static final Pattern TAG_REGEX = Pattern.compile("<AT>(.*)</AT>");
public static String getText(String text) {
final Matcher matcher = TAG_REGEX.matcher(text);
while (matcher.find()) {
String url = matcher.group(1);
System.out.println("Extracted URL::"+url);
}
}
Getting output from above program:
Extracted URL::EXTRACT_URL</AT>" target="_blank" style="font-weight: bold;letter-spacing: normal;line-height: 100%;text-align: center;text-decoration: none;color: #FFFFFF;">Sign In</a></td></tr></tbody></table></td></tr> <a href="<AT>EXTRACT_URL</AT>" target="_blank" title="" class="" target="_blank"> <a href="<AT>EXTRACT_URL</AT>" target="_blank" title="" class="" target="_blank"> <a href="<AT>EXTRACT_URL
Expected Output:
Extracted URL::EXTRACT_URL
Extracted URL::EXTRACT_URL
Extracted URL::EXTRACT_URL
Extracted URL::EXTRACT_URL
Why I am not getting expected output?
答案1
得分: 2
这是因为Pattern
。
在这种情况下,正确的模式应该是
private static final Pattern TAG_REGEX = Pattern.compile("<AT>(.*?)</AT>");
两者都会匹配任何字符序列,但是
.*
是贪婪的,会尽可能多地匹配(它会在最后一个</AT>
处结束).*?
是勉强的,会尽可能少地匹配
更多信息请参阅此教程。
英文:
It's because of the Pattern
Correct patter in this case would be
private static final Pattern TAG_REGEX = Pattern.compile("<AT>(.*?)</AT>");
Both will match any sequence of characters but
.*
is greedy and will match as much as possible (it will end at the last</AT>
).*?
is reluctant and will match as few as possible
More at this tutorial
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论