删除 Jsoup 中的父标签而不删除子标签

huangapple go评论139阅读模式
英文:

Deleting parent tags without deleting children with Jsoup

问题

  1. <div>
  2. <img alt="alt" src="https://somelink.com">
  3. </div>
英文:

Sample code to remake:

  1. &lt;div class=&quot;mrd3w m6et0 _2d49e_1O4vF&quot;&gt;
  2. &lt;div class=&quot;p1td4 pw4go p513t al2kje m10qy mij5n&quot;&gt;
  3. &lt;div class=&quot;_2d49e_2tor6&quot; style=&quot;max-width:871px;max-height:552px&quot;&gt;
  4. &lt;div class=&quot;ptv8j2&quot; style=&quot;padding-top:calc(100% * 552 / 871)&quot;&gt;
  5. &lt;img alt=&quot;alt&quot; class=&quot;_2d49e_3B1Cq pt94f9 pt1itw ptux49 w1eai _2d49e_32cUf lazyloaded&quot; sizes=&quot;(min-width: 1200px) 560px, (min-width: 992px) 50vw, 100vw&quot; src=&quot;https://somelink.com 871w&quot; width=&quot;871px&quot;&gt;
  6. &lt;/div&gt;
  7. &lt;/div&gt;
  8. &lt;/div&gt;
  9. &lt;/div&gt;

I have already deleted some usless links and imports from this html and this is my last problem. Classes of divs are random and there are a lot of them.

I need to get simple clean code like this:

  1. &lt;div&gt;
  2. &lt;img alt=&quot;alt&quot; src=&quot;https://somelink.com&quot;&gt;
  3. &lt;/div&gt;

I am creating xml file from databse, and description of each product is a mess that needs to be as clean as possible. Whole description is in database as a value with all this mess iports and tags. I am using Jsoup to remake this description, but have no clue how to delete parents without deleting children.

答案1

得分: 1

这需要两个步骤:

  1. 为了清除不需要的标签和属性,使用 WhitelistJsoup.clean(html, whitelist)
  2. 要删除父元素,您可以使用 element.unwrap()。要删除重复的父元素,我们可以使用循环向上移动,并在它们相同的情况下将它们移除。

以下是执行此操作的代码:

  1. public class JsoupIssue61137870 {
  2. public static void main(final String[] args) throws IOException {
  3. String html = " <div class=\"mrd3w m6et0 _2d49e_1O4vF\"> \n"
  4. + " <div class=\"p1td4 pw4go p513t al2kje m10qy mij5n\"> \n"
  5. + " <div class=\"_2d49e_2tor6\" style=\"max-width:871px;max-height:552px\"> \n"
  6. + " <div class=\"ptv8j2\" style=\"padding-top:calc(100% * 552 / 871)\">\n"
  7. + " <img alt=\"alt\" class=\"_2d49e_3B1Cq pt94f9 pt1itw ptux49 w1eai _2d49e_32cUf lazyloaded\" sizes=\"(min-width: 1200px) 560px, (min-width: 992px) 50vw, 100vw\" src=\"https://somelink.com 871w\" width=\"871px\">\n"
  8. + " </div> \n" + " </div> \n" + " </div> \n" + " </div> ";
  9. Whitelist whitelist = Whitelist.none();
  10. whitelist.addTags("div", "img");
  11. whitelist.addAttributes("img", "src");
  12. String cleanHTML = Jsoup.clean(html, whitelist);
  13. System.out.println(cleanHTML);
  14. String result = removeRepeatingTags(cleanHTML);
  15. System.out.println(result);
  16. }
  17. private static String removeRepeatingTags(String html) {
  18. Document doc = Jsoup.parse(html);
  19. Element img = doc.selectFirst("img");
  20. Element parent = img.parent();
  21. while (parent.tagName().equals(parent.parent().tagName())) {
  22. parent.unwrap();
  23. parent = img.parent();
  24. }
  25. return doc.toString();
  26. }
  27. }

第一部分的输出是:

  1. <div>
  2. <div>
  3. <div>
  4. <div>
  5. <img alt="alt" src="https://somelink.com 871w">
  6. </div>
  7. </div>
  8. </div>
  9. </div>

第二部分之后的输出是:

  1. <html>
  2. <head></head>
  3. <body>
  4. <div>
  5. <img alt="alt" src="https://somelink.com 871w">
  6. </div>
  7. </body>
  8. </html>

Jsoup 会添加 &lt;html&gt;&lt;head&gt;&lt;body&gt; 标签。为了避免这种情况,不要使用

  1. Document doc = Jsoup.parse(html);

而应该使用

  1. Document doc = Jsoup.parse(html, "", Parser.xmlParser());

这样输出将与您所期望的完全一致:

  1. <div>
  2. <img alt="alt" src="https://somelink.com 871w">
  3. </div>
英文:

This requires two steps:

  1. To clean unwanted tags and attributes use Whitelist and Jsoup.clean(html, whitelist)
  2. To remove parent you can use element.unwrap(). To remove repeating parents we can move up using a loop and remove them if they are the same.

That's the code to do this:

  1. public class JsoupIssue61137870 {
  2. public static void main(final String[] args) throws IOException {
  3. String html = &quot; &lt;div class=\&quot;mrd3w m6et0 _2d49e_1O4vF\&quot;&gt; \n&quot;
  4. + &quot; &lt;div class=\&quot;p1td4 pw4go p513t al2kje m10qy mij5n\&quot;&gt; \n&quot;
  5. + &quot; &lt;div class=\&quot;_2d49e_2tor6\&quot; style=\&quot;max-width:871px;max-height:552px\&quot;&gt; \n&quot;
  6. + &quot; &lt;div class=\&quot;ptv8j2\&quot; style=\&quot;padding-top:calc(100% * 552 / 871)\&quot;&gt;\n&quot;
  7. + &quot; &lt;img alt=\&quot;alt\&quot; class=\&quot;_2d49e_3B1Cq pt94f9 pt1itw ptux49 w1eai _2d49e_32cUf lazyloaded\&quot; sizes=\&quot;(min-width: 1200px) 560px, (min-width: 992px) 50vw, 100vw\&quot; src=\&quot;https://somelink.com 871w\&quot; width=\&quot;871px\&quot;&gt;\n&quot;
  8. + &quot; &lt;/div&gt; \n&quot; + &quot; &lt;/div&gt; \n&quot; + &quot; &lt;/div&gt; \n&quot; + &quot; &lt;/div&gt; &quot;;
  9. Whitelist whitelist = Whitelist.none();
  10. whitelist.addTags(&quot;div&quot;, &quot;img&quot;);
  11. whitelist.addAttributes(&quot;img&quot;, &quot;src&quot;);
  12. String cleanHTML = Jsoup.clean(html, whitelist);
  13. System.out.println(cleanHTML);
  14. String result = removeRepeatingTags(cleanHTML);
  15. System.out.println(result);
  16. }
  17. private static String removeRepeatingTags(String html) {
  18. Document doc = Jsoup.parse(html);
  19. Element img = doc.selectFirst(&quot;img&quot;);
  20. Element parent = img.parent();
  21. while (parent.tagName().equals(parent.parent().tagName())) {
  22. parent.unwrap();
  23. parent = img.parent();
  24. }
  25. return doc.toString();
  26. }
  27. }

The ouput of the first part is:

  1. &lt;div&gt;
  2. &lt;div&gt;
  3. &lt;div&gt;
  4. &lt;div&gt;
  5. &lt;img alt=&quot;alt&quot; src=&quot;https://somelink.com 871w&quot;&gt;
  6. &lt;/div&gt;
  7. &lt;/div&gt;
  8. &lt;/div&gt;
  9. &lt;/div&gt;

and the output after second part is:

  1. &lt;html&gt;
  2. &lt;head&gt;&lt;/head&gt;
  3. &lt;body&gt;
  4. &lt;div&gt;
  5. &lt;img alt=&quot;alt&quot; src=&quot;https://somelink.com 871w&quot;&gt;
  6. &lt;/div&gt;
  7. &lt;/body&gt;
  8. &lt;/html&gt;

Jsoup will add &lt;html&gt; &lt;head&gt; and &lt;body&gt; tags. To avoid this instead of

  1. Document doc = Jsoup.parse(html);

use

  1. Document doc = Jsoup.parse(html, &quot;&quot;, Parser.xmlParser());

and the output will be exactly what you expect:

  1. &lt;div&gt;
  2. &lt;img alt=&quot;alt&quot; src=&quot;https://somelink.com 871w&quot;&gt;
  3. &lt;/div&gt;

huangapple
  • 本文由 发表于 2020年4月10日 17:56:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/61137870.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定