删除 Jsoup 中的父标签而不删除子标签

huangapple go评论101阅读模式
英文:

Deleting parent tags without deleting children with Jsoup

问题

<div>
  <img alt="alt" src="https://somelink.com">
</div>
英文:

Sample code to remake:

       &lt;div class=&quot;mrd3w m6et0 _2d49e_1O4vF&quot;&gt; 
        &lt;div class=&quot;p1td4 pw4go p513t al2kje m10qy mij5n&quot;&gt; 
         &lt;div class=&quot;_2d49e_2tor6&quot; style=&quot;max-width:871px;max-height:552px&quot;&gt; 
          &lt;div class=&quot;ptv8j2&quot; style=&quot;padding-top:calc(100% * 552 / 871)&quot;&gt;
           &lt;img alt=&quot;alt&quot; class=&quot;_2d49e_3B1Cq pt94f9 pt1itw ptux49 w1eai _2d49e_32cUf lazyloaded&quot; sizes=&quot;(min-width: 1200px) 560px, (min-width: 992px) 50vw, 100vw&quot; src=&quot;https://somelink.com 871w&quot; width=&quot;871px&quot;&gt;
          &lt;/div&gt; 
         &lt;/div&gt; 
        &lt;/div&gt; 
       &lt;/div&gt; 

I have already deleted some usless links and imports from this html and this is my last problem. Classes of divs are random and there are a lot of them.

I need to get simple clean code like this:

&lt;div&gt;
  &lt;img alt=&quot;alt&quot; src=&quot;https://somelink.com&quot;&gt;
&lt;/div&gt;

I am creating xml file from databse, and description of each product is a mess that needs to be as clean as possible. Whole description is in database as a value with all this mess iports and tags. I am using Jsoup to remake this description, but have no clue how to delete parents without deleting children.

答案1

得分: 1

这需要两个步骤:

  1. 为了清除不需要的标签和属性,使用 WhitelistJsoup.clean(html, whitelist)
  2. 要删除父元素,您可以使用 element.unwrap()。要删除重复的父元素,我们可以使用循环向上移动,并在它们相同的情况下将它们移除。

以下是执行此操作的代码:

public class JsoupIssue61137870 {

    public static void main(final String[] args) throws IOException {
        String html = "  <div class=\"mrd3w m6et0 _2d49e_1O4vF\"> \n"
                + "        <div class=\"p1td4 pw4go p513t al2kje m10qy mij5n\"> \n"
                + "         <div class=\"_2d49e_2tor6\" style=\"max-width:871px;max-height:552px\"> \n"
                + "          <div class=\"ptv8j2\" style=\"padding-top:calc(100% * 552 / 871)\">\n"
                + "           <img alt=\"alt\" class=\"_2d49e_3B1Cq pt94f9 pt1itw ptux49 w1eai _2d49e_32cUf lazyloaded\" sizes=\"(min-width: 1200px) 560px, (min-width: 992px) 50vw, 100vw\" src=\"https://somelink.com 871w\" width=\"871px\">\n"
                + "          </div> \n" + "         </div> \n" + "        </div> \n" + "       </div> ";

        Whitelist whitelist = Whitelist.none();
        whitelist.addTags("div", "img");
        whitelist.addAttributes("img", "src");
        String cleanHTML = Jsoup.clean(html, whitelist);
        System.out.println(cleanHTML);

        String result = removeRepeatingTags(cleanHTML);
        System.out.println(result);
    }

    private static String removeRepeatingTags(String html) {
        Document doc = Jsoup.parse(html);
        Element img = doc.selectFirst("img");
        Element parent = img.parent();
        while (parent.tagName().equals(parent.parent().tagName())) {
            parent.unwrap();
            parent = img.parent();
        }
        return doc.toString();
    }
}

第一部分的输出是:

<div> 
 <div> 
  <div> 
   <div> 
    <img alt="alt" src="https://somelink.com 871w"> 
   </div> 
  </div> 
 </div> 
</div>

第二部分之后的输出是:

<html>
 <head></head>
 <body>
  <div>    
    <img alt="alt" src="https://somelink.com 871w">  
  </div>
 </body>
</html>

Jsoup 会添加 &lt;html&gt;&lt;head&gt;&lt;body&gt; 标签。为了避免这种情况,不要使用

    Document doc = Jsoup.parse(html);

而应该使用

    Document doc = Jsoup.parse(html, "", Parser.xmlParser());

这样输出将与您所期望的完全一致:

<div>    
 <img alt="alt" src="https://somelink.com 871w">    
</div>
英文:

This requires two steps:

  1. To clean unwanted tags and attributes use Whitelist and Jsoup.clean(html, whitelist)
  2. To remove parent you can use element.unwrap(). To remove repeating parents we can move up using a loop and remove them if they are the same.

That's the code to do this:

public class JsoupIssue61137870 {

	public static void main(final String[] args) throws IOException {
		String html = &quot;  &lt;div class=\&quot;mrd3w m6et0 _2d49e_1O4vF\&quot;&gt; \n&quot;
				+ &quot;        &lt;div class=\&quot;p1td4 pw4go p513t al2kje m10qy mij5n\&quot;&gt; \n&quot;
				+ &quot;         &lt;div class=\&quot;_2d49e_2tor6\&quot; style=\&quot;max-width:871px;max-height:552px\&quot;&gt; \n&quot;
				+ &quot;          &lt;div class=\&quot;ptv8j2\&quot; style=\&quot;padding-top:calc(100% * 552 / 871)\&quot;&gt;\n&quot;
				+ &quot;           &lt;img alt=\&quot;alt\&quot; class=\&quot;_2d49e_3B1Cq pt94f9 pt1itw ptux49 w1eai _2d49e_32cUf lazyloaded\&quot; sizes=\&quot;(min-width: 1200px) 560px, (min-width: 992px) 50vw, 100vw\&quot; src=\&quot;https://somelink.com 871w\&quot; width=\&quot;871px\&quot;&gt;\n&quot;
				+ &quot;          &lt;/div&gt; \n&quot; + &quot;         &lt;/div&gt; \n&quot; + &quot;        &lt;/div&gt; \n&quot; + &quot;       &lt;/div&gt; &quot;;

		Whitelist whitelist = Whitelist.none();
		whitelist.addTags(&quot;div&quot;, &quot;img&quot;);
		whitelist.addAttributes(&quot;img&quot;, &quot;src&quot;);
		String cleanHTML = Jsoup.clean(html, whitelist);
		System.out.println(cleanHTML);

		String result = removeRepeatingTags(cleanHTML);
		System.out.println(result);
	}

	private static String removeRepeatingTags(String html) {
		Document doc = Jsoup.parse(html);
		Element img = doc.selectFirst(&quot;img&quot;);
		Element parent = img.parent();
		while (parent.tagName().equals(parent.parent().tagName())) {
			parent.unwrap();
			parent = img.parent();
		}
		return doc.toString();
	}
}

The ouput of the first part is:

&lt;div&gt; 
 &lt;div&gt; 
  &lt;div&gt; 
   &lt;div&gt; 
    &lt;img alt=&quot;alt&quot; src=&quot;https://somelink.com 871w&quot;&gt; 
   &lt;/div&gt; 
  &lt;/div&gt; 
 &lt;/div&gt; 
&lt;/div&gt;

and the output after second part is:

&lt;html&gt;
 &lt;head&gt;&lt;/head&gt;
 &lt;body&gt;
  &lt;div&gt;    
    &lt;img alt=&quot;alt&quot; src=&quot;https://somelink.com 871w&quot;&gt;  
  &lt;/div&gt;
 &lt;/body&gt;
&lt;/html&gt;

Jsoup will add &lt;html&gt; &lt;head&gt; and &lt;body&gt; tags. To avoid this instead of

    Document doc = Jsoup.parse(html);

use

    Document doc = Jsoup.parse(html, &quot;&quot;, Parser.xmlParser());

and the output will be exactly what you expect:

&lt;div&gt;    
 &lt;img alt=&quot;alt&quot; src=&quot;https://somelink.com 871w&quot;&gt;    
&lt;/div&gt;

huangapple
  • 本文由 发表于 2020年4月10日 17:56:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/61137870.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定