替换Java中的ASCII码和HTML标签

huangapple go评论69阅读模式
英文:

Replace ASCII codes and HTML tags in Java

问题

如何在不使用 `StringEscapeUtils` 的情况下实现以下期望结果

    public class Main {
        public static void main(String[] args) throws Exception {
          String str = "<p><b>Send FWB <br><br> &#40;if AWB has COU SHC, <br> if ticked , will send FWB&#41;</b></p>";
          str = str.replaceAll("\\&lt;.*?\\&gt;", "");
          System.out.println("After removing HTML Tags: " + str);
        }
    }

**当前结果**

    After removing HTML Tags: Send FWB  &#40;if AWB has COU SHC,  if ticked , will send FWB&#41;

**期望结果**

    After removing HTML Tags: Send FWB  if AWB has COU SHC,  if ticked , will send FWB;

已经检查过
https://stackoverflow.com/questions/994331/how-to-unescape-html-character-entities-in-java

<hr>

**** 这只是一个示例输入可能会有所不同
英文:

How can i achieve below expecting results without using StringEscapeUtils ?

public class Main {
    public static void main(String[] args) throws Exception {
      String str = &quot;&lt;p&gt;&lt;b&gt;Send FWB &lt;br&gt;&lt;br&gt; &amp;#40;if AWB has COU SHC, &lt;br&gt; if ticked , will send FWB&amp;#41;&lt;/b&gt;&lt;/p&gt;&quot;;
      str = str.replaceAll(&quot;\\&lt;.*?\\&gt;&quot;, &quot;&quot;);
      System.out.println(&quot;After removing HTML Tags: &quot; + str);
    }
}

Current Results:

After removing HTML Tags: Send FWB  &amp;#40;if AWB has COU SHC,  if ticked , will send FWB&amp;#41;

Expecting Results:

After removing HTML Tags: Send FWB  if AWB has COU SHC,  if ticked , will send FWB;

Already checked:
https://stackoverflow.com/questions/994331/how-to-unescape-html-character-entities-in-java

<hr>

PS: This is just a sample example, input may vary.

答案1

得分: 1

你的正则表达式是用于匹配 HTML 标签 &lt;something&gt; 的,但 HTML 实体将不会被匹配。它们的模式类似于 &amp;.*?;,而你并没有进行替换。

以下代码应该能解决你的问题:

str = str.replaceAll("&amp;lt;.*?&amp;gt;|&amp;.*?;", "");

如果你想在沙盒中尝试这个正则表达式,可以访问 regxr.com 并使用 (\&lt;.*?\&gt;)|(&amp;.*?;),括号可以使两个不同的捕获组在工具中更易于识别,但在你的代码中不是必需的。请注意,在那个沙盒游乐场上,\ 不需要转义,但在你的代码中需要转义,因为它在一个字符串中。

英文:

Your regexp is for html tags &lt;something&gt; would be matched byt the html entities will not be matched. Their pattern is something like &amp;.*?; Which you are not replacing.

this should solve your trouble:

str = str.replaceAll(&quot;\\&lt;.*?\\&gt;|&amp;.*?;&quot;, &quot;&quot;);

If you want to experiment with this in a sandbox, try regxr.com and use (\&lt;.*?\&gt;)|(&amp;.*?;) the brackets make the two different capturing groups easy to identify on the tool and are not needed in your code. note that the \does not need to be escaped on that sandbox playground, but it has to be in your code, since it's in a string.

huangapple
  • 本文由 发表于 2020年8月26日 14:58:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/63592137.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定