使用Java从特定标签中提取字符串

huangapple go评论59阅读模式
英文:

Extraction of string from particular tag using java

问题

为什么我没有得到预期的输出?

您没有得到预期的输出是因为您的正则表达式匹配过于贪婪。在正则表达式中,通常使用.*来匹配任意字符,但默认情况下它是贪婪的,会尽可能多地匹配字符。这导致您的正则表达式匹配了所有位于第一个<AT>和最后一个</AT>之间的文本。

要修复这个问题,您可以将正则表达式改为非贪婪匹配,使用.*?代替.*。这样正则表达式将尽可能少地匹配字符,以便找到最近的<AT></AT>标签对。

以下是修正后的正则表达式和代码:

private static final Pattern TAG_REGEX = Pattern.compile("<AT>(.*?)</AT>");

public static void getText(String text) {
    final Matcher matcher = TAG_REGEX.matcher(text);

    while (matcher.find()) {
        String url = matcher.group(1);
        System.out.println("Extracted URL::" + url);
    }
}

使用这个修正后的正则表达式,您应该能够得到预期的输出:

Extracted URL::EXTRACT_URL
Extracted URL::EXTRACT_URL
Extracted URL::EXTRACT_URL
Extracted URL::EXTRACT_URL
英文:

I am having few tags inside of html. As you can see in below HTML having &lt;AT&gt;&lt;/AT&gt;. So I need to extract text from &lt;AT&gt;&lt;/AT&gt; this tag.

I have followed below approach

  1. Written one regex what will extract text from AT tag

Below is testing string::

href=&quot;&lt;AT&gt;EXTRACT_URL&lt;/AT&gt;&quot; target=&quot;_blank&quot; style=&quot;font-weight: bold;letter-spacing: normal;line-height: 100%;text-align: center;text-decoration: none;color: #FFFFFF;&quot;&gt;Sign In&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/td&gt;&lt;/tr&gt; &lt;a href=&quot;&lt;AT&gt;EXTRACT_URL&lt;/AT&gt;&quot; target=&quot;_blank&quot; title=&quot;&quot; class=&quot;&quot; target=&quot;_blank&quot;&gt; &lt;a href=&quot;&lt;AT&gt;EXTRACT_URL&lt;/AT&gt;&quot; target=&quot;_blank&quot; title=&quot;&quot; class=&quot;&quot; target=&quot;_blank&quot;&gt; &lt;a href=&quot;&lt;AT&gt;EXTRACT_URL&lt;/AT&gt;&quot; target=&quot;_blank&quot; title=&quot;&quot; class=&quot;&quot; target=&quot;_blank&quot;&gt;

Used below program for extracting text from AT Tag

private static final Pattern TAG_REGEX = Pattern.compile(&quot;&lt;AT&gt;(.*)&lt;/AT&gt;&quot;);

public static String getText(String text) {
	final Matcher matcher = TAG_REGEX.matcher(text);

	while (matcher.find()) {
		String url = matcher.group(1);
		
		System.out.println(&quot;Extracted URL::&quot;+url);						
	}	
}

Getting output from above program:

Extracted URL::EXTRACT_URL&lt;/AT&gt;&quot; target=&quot;_blank&quot; style=&quot;font-weight: bold;letter-spacing: normal;line-height: 100%;text-align: center;text-decoration: none;color: #FFFFFF;&quot;&gt;Sign In&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/td&gt;&lt;/tr&gt; &lt;a href=&quot;&lt;AT&gt;EXTRACT_URL&lt;/AT&gt;&quot; target=&quot;_blank&quot; title=&quot;&quot; class=&quot;&quot; target=&quot;_blank&quot;&gt; &lt;a href=&quot;&lt;AT&gt;EXTRACT_URL&lt;/AT&gt;&quot; target=&quot;_blank&quot; title=&quot;&quot; class=&quot;&quot; target=&quot;_blank&quot;&gt; &lt;a href=&quot;&lt;AT&gt;EXTRACT_URL

Expected Output:

Extracted URL::EXTRACT_URL
Extracted URL::EXTRACT_URL
Extracted URL::EXTRACT_URL
Extracted URL::EXTRACT_URL

Why I am not getting expected output?

答案1

得分: 2

这是因为Pattern

在这种情况下,正确的模式应该是

private static final Pattern TAG_REGEX = Pattern.compile("<AT>(.*?)</AT>");

两者都会匹配任何字符序列,但是

  • .* 是贪婪的,会尽可能多地匹配(它会在最后一个</AT>处结束)
  • .*? 是勉强的,会尽可能少地匹配

更多信息请参阅此教程

英文:

It's because of the Pattern

Correct patter in this case would be

private static final Pattern TAG_REGEX = Pattern.compile(&quot;&lt;AT&gt;(.*?)&lt;/AT&gt;&quot;);

Both will match any sequence of characters but

  • .* is greedy and will match as much as possible (it will end at the last &lt;/AT&gt;)
  • .*? is reluctant and will match as few as possible

More at this tutorial

huangapple
  • 本文由 发表于 2020年8月5日 21:22:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/63266153.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定