英文:
Replace ASCII codes and HTML tags in Java
问题
如何在不使用 `StringEscapeUtils` 的情况下实现以下期望结果?
public class Main {
public static void main(String[] args) throws Exception {
String str = "<p><b>Send FWB <br><br> (if AWB has COU SHC, <br> if ticked , will send FWB)</b></p>";
str = str.replaceAll("\\<.*?\\>", "");
System.out.println("After removing HTML Tags: " + str);
}
}
**当前结果:**
After removing HTML Tags: Send FWB (if AWB has COU SHC, if ticked , will send FWB)
**期望结果:**
After removing HTML Tags: Send FWB if AWB has COU SHC, if ticked , will send FWB;
已经检查过:
https://stackoverflow.com/questions/994331/how-to-unescape-html-character-entities-in-java
<hr>
**注:** 这只是一个示例,输入可能会有所不同。
英文:
How can i achieve below expecting results without using StringEscapeUtils
?
public class Main {
public static void main(String[] args) throws Exception {
String str = "<p><b>Send FWB <br><br> &#40;if AWB has COU SHC, <br> if ticked , will send FWB&#41;</b></p>";
str = str.replaceAll("\\<.*?\\>", "");
System.out.println("After removing HTML Tags: " + str);
}
}
Current Results:
After removing HTML Tags: Send FWB &#40;if AWB has COU SHC, if ticked , will send FWB&#41;
Expecting Results:
After removing HTML Tags: Send FWB if AWB has COU SHC, if ticked , will send FWB;
Already checked:
https://stackoverflow.com/questions/994331/how-to-unescape-html-character-entities-in-java
<hr>
PS: This is just a sample example, input may vary.
答案1
得分: 1
你的正则表达式是用于匹配 HTML 标签 <something>
的,但 HTML 实体将不会被匹配。它们的模式类似于 &.*?;
,而你并没有进行替换。
以下代码应该能解决你的问题:
str = str.replaceAll("&lt;.*?&gt;|&.*?;", "");
如果你想在沙盒中尝试这个正则表达式,可以访问 regxr.com 并使用 (\<.*?\>)|(&.*?;)
,括号可以使两个不同的捕获组在工具中更易于识别,但在你的代码中不是必需的。请注意,在那个沙盒游乐场上,\
不需要转义,但在你的代码中需要转义,因为它在一个字符串中。
英文:
Your regexp is for html tags <something>
would be matched byt the html entities will not be matched. Their pattern is something like &.*?;
Which you are not replacing.
this should solve your trouble:
str = str.replaceAll("\\<.*?\\>|&.*?;", "");
If you want to experiment with this in a sandbox, try regxr.com and use (\<.*?\>)|(&.*?;)
the brackets make the two different capturing groups easy to identify on the tool and are not needed in your code. note that the \
does not need to be escaped on that sandbox playground, but it has to be in your code, since it's in a string.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论