英文:
Deleting parent tags without deleting children with Jsoup
问题
<div>
<img alt="alt" src="https://somelink.com">
</div>
英文:
Sample code to remake:
<div class="mrd3w m6et0 _2d49e_1O4vF">
<div class="p1td4 pw4go p513t al2kje m10qy mij5n">
<div class="_2d49e_2tor6" style="max-width:871px;max-height:552px">
<div class="ptv8j2" style="padding-top:calc(100% * 552 / 871)">
<img alt="alt" class="_2d49e_3B1Cq pt94f9 pt1itw ptux49 w1eai _2d49e_32cUf lazyloaded" sizes="(min-width: 1200px) 560px, (min-width: 992px) 50vw, 100vw" src="https://somelink.com 871w" width="871px">
</div>
</div>
</div>
</div>
I have already deleted some usless links and imports from this html and this is my last problem. Classes of divs are random and there are a lot of them.
I need to get simple clean code like this:
<div>
<img alt="alt" src="https://somelink.com">
</div>
I am creating xml file from databse, and description of each product is a mess that needs to be as clean as possible. Whole description is in database as a value with all this mess iports and tags. I am using Jsoup to remake this description, but have no clue how to delete parents without deleting children.
答案1
得分: 1
这需要两个步骤:
- 为了清除不需要的标签和属性,使用
Whitelist
和Jsoup.clean(html, whitelist)
。 - 要删除父元素,您可以使用
element.unwrap()
。要删除重复的父元素,我们可以使用循环向上移动,并在它们相同的情况下将它们移除。
以下是执行此操作的代码:
public class JsoupIssue61137870 {
public static void main(final String[] args) throws IOException {
String html = " <div class=\"mrd3w m6et0 _2d49e_1O4vF\"> \n"
+ " <div class=\"p1td4 pw4go p513t al2kje m10qy mij5n\"> \n"
+ " <div class=\"_2d49e_2tor6\" style=\"max-width:871px;max-height:552px\"> \n"
+ " <div class=\"ptv8j2\" style=\"padding-top:calc(100% * 552 / 871)\">\n"
+ " <img alt=\"alt\" class=\"_2d49e_3B1Cq pt94f9 pt1itw ptux49 w1eai _2d49e_32cUf lazyloaded\" sizes=\"(min-width: 1200px) 560px, (min-width: 992px) 50vw, 100vw\" src=\"https://somelink.com 871w\" width=\"871px\">\n"
+ " </div> \n" + " </div> \n" + " </div> \n" + " </div> ";
Whitelist whitelist = Whitelist.none();
whitelist.addTags("div", "img");
whitelist.addAttributes("img", "src");
String cleanHTML = Jsoup.clean(html, whitelist);
System.out.println(cleanHTML);
String result = removeRepeatingTags(cleanHTML);
System.out.println(result);
}
private static String removeRepeatingTags(String html) {
Document doc = Jsoup.parse(html);
Element img = doc.selectFirst("img");
Element parent = img.parent();
while (parent.tagName().equals(parent.parent().tagName())) {
parent.unwrap();
parent = img.parent();
}
return doc.toString();
}
}
第一部分的输出是:
<div>
<div>
<div>
<div>
<img alt="alt" src="https://somelink.com 871w">
</div>
</div>
</div>
</div>
第二部分之后的输出是:
<html>
<head></head>
<body>
<div>
<img alt="alt" src="https://somelink.com 871w">
</div>
</body>
</html>
Jsoup 会添加 <html>
、<head>
和 <body>
标签。为了避免这种情况,不要使用
Document doc = Jsoup.parse(html);
而应该使用
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
这样输出将与您所期望的完全一致:
<div>
<img alt="alt" src="https://somelink.com 871w">
</div>
英文:
This requires two steps:
- To clean unwanted tags and attributes use
Whitelist
andJsoup.clean(html, whitelist)
- To remove parent you can use
element.unwrap()
. To remove repeating parents we can move up using a loop and remove them if they are the same.
That's the code to do this:
public class JsoupIssue61137870 {
public static void main(final String[] args) throws IOException {
String html = " <div class=\"mrd3w m6et0 _2d49e_1O4vF\"> \n"
+ " <div class=\"p1td4 pw4go p513t al2kje m10qy mij5n\"> \n"
+ " <div class=\"_2d49e_2tor6\" style=\"max-width:871px;max-height:552px\"> \n"
+ " <div class=\"ptv8j2\" style=\"padding-top:calc(100% * 552 / 871)\">\n"
+ " <img alt=\"alt\" class=\"_2d49e_3B1Cq pt94f9 pt1itw ptux49 w1eai _2d49e_32cUf lazyloaded\" sizes=\"(min-width: 1200px) 560px, (min-width: 992px) 50vw, 100vw\" src=\"https://somelink.com 871w\" width=\"871px\">\n"
+ " </div> \n" + " </div> \n" + " </div> \n" + " </div> ";
Whitelist whitelist = Whitelist.none();
whitelist.addTags("div", "img");
whitelist.addAttributes("img", "src");
String cleanHTML = Jsoup.clean(html, whitelist);
System.out.println(cleanHTML);
String result = removeRepeatingTags(cleanHTML);
System.out.println(result);
}
private static String removeRepeatingTags(String html) {
Document doc = Jsoup.parse(html);
Element img = doc.selectFirst("img");
Element parent = img.parent();
while (parent.tagName().equals(parent.parent().tagName())) {
parent.unwrap();
parent = img.parent();
}
return doc.toString();
}
}
The ouput of the first part is:
<div>
<div>
<div>
<div>
<img alt="alt" src="https://somelink.com 871w">
</div>
</div>
</div>
</div>
and the output after second part is:
<html>
<head></head>
<body>
<div>
<img alt="alt" src="https://somelink.com 871w">
</div>
</body>
</html>
Jsoup will add <html>
<head>
and <body>
tags. To avoid this instead of
Document doc = Jsoup.parse(html);
use
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
and the output will be exactly what you expect:
<div>
<img alt="alt" src="https://somelink.com 871w">
</div>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论