合并JSoup中相同的元素

huangapple go评论115阅读模式
英文:

Merging same elements in JSoup

问题

我有类似以下的HTML字符串:

<b>test</b><b>er</b>
<span class="ab">continue</span><span> without</span>

我想要合并相似且属于同一组的标签。在上面的示例中,我想得到:

<b>tester</b>

因为这些标签拥有相同的标签且没有其他属性或样式。但对于span标签,它应该保持不变,因为它有一个class属性。我知道我可以通过Jsoup遍历整个树来实现。

Document doc = Jsoup.parse(input);
for (Element element : doc.select("b")) {
}

但我不清楚如何向前查找(我猜类似于nextSibling),然后如何合并这些元素?

或者是否存在一个简单的正则表达式合并方法?

我可以自己指定属性。不需要一个适用于所有标签的解决方案。

英文:

I have the HTML string like

<b>test</b><b>er</b>
<span class="ab">continue</span><span> without</span>

I want to collapse the Tags which are similar and belong to each other. In the above sample I want to have

<b>tester</b>

since the tags have the same tag withouth any further attribute or style. But for the span Tag it should remain the same because it has a class attribute. I am aware that I can iterate via Jsoup over the tree.

Document doc = Jsoup.parse(input);
for (Element element : doc.select("b")) {
}

But I'm not clear how look forward (I guess something like nextSibling) but than how to collapse the elements?

Or exists a simple regexp merge?

The attributes I can specify on my own. It's not required to have a one-fits-for-all Tag solution.

答案1

得分: 1

public class StackOverflow60704600 {

	public static void main(final String[] args) throws IOException {
		Document doc = Jsoup.parse("<b>test</b><b>er</b><span class=\"ab\">continue</span><span> without</span>");
		mergeSiblings(doc, "b");
		System.out.println(doc);

	}

	private static void mergeSiblings(Document doc, String selector) {
		Elements elements = doc.select(selector);
		for (Element element : elements) {
			Element nextSibling = element.nextElementSibling();
			if (nextSibling != null && nextSibling.tagName().equals(element.tagName())
					&& nextSibling.attributes().equals(element.attributes())) {
				while (nextSibling.childNodes().size() > 0) {
					Node siblingChildNode = nextSibling.childNodes().get(0);
					element.appendChild(siblingChildNode);
				}
				nextSibling.remove();
			}
		}
	}
}

Output:

<html>
 <head></head>
 <body>
  <b>tester</b>
  <span class="ab">continue</span>
  <span> without</span>
 </body>
</html>

One more note on why I used loop while (nextSibling.childNodes().size() > 0). It turned out for or iterator couldn't be used here because appendChild adds the child but removes it from the source element and remaining children are shifted. It may not be visible here, but the problem will appear when you try to merge: <b>test</b><b>er<a>123</a></b>.

英文:

My approach would be like this. Comments in the code

public class StackOverflow60704600 {

	public static void main(final String[] args) throws IOException {
		Document doc = Jsoup.parse(&quot;&lt;b&gt;test&lt;/b&gt;&lt;b&gt;er&lt;/b&gt;&lt;span class=\&quot;ab\&quot;&gt;continue&lt;/span&gt;&lt;span&gt; without&lt;/span&gt;&quot;);
		mergeSiblings(doc, &quot;b&quot;);
		System.out.println(doc);

	}

	private static void mergeSiblings(Document doc, String selector) {
		Elements elements = doc.select(selector);
		for (Element element : elements) {
            // get the next sibling
			Element nextSibling = element.nextElementSibling();
            // merge only if the next sibling has the same tag name and the same set of attributes
			if (nextSibling != null &amp;&amp; nextSibling.tagName().equals(element.tagName())
					&amp;&amp; nextSibling.attributes().equals(element.attributes())) {
                // your element has only one child, but let&#39;s rewrite all of them if there&#39;s more
				while (nextSibling.childNodes().size() &gt; 0) {
					Node siblingChildNode = nextSibling.childNodes().get(0);
					element.appendChild(siblingChildNode);
				}
                // remove because now it doesn&#39;t have any children
				nextSibling.remove();
			}
		}
	}
}

output:

&lt;html&gt;
 &lt;head&gt;&lt;/head&gt;
 &lt;body&gt;
  &lt;b&gt;tester&lt;/b&gt;
  &lt;span class=&quot;ab&quot;&gt;continue&lt;/span&gt;
  &lt;span&gt; without&lt;/span&gt;
 &lt;/body&gt;
&lt;/html&gt;

One more note on why I used loop while (nextSibling.childNodes().size() &gt; 0). It turned out for or iterator couldn't be used here because appendChild adds the child but removes it from the source element and remaining childen are be shifted. It may not be visible here but the problem will appear when you try to merge: &lt;b&gt;test&lt;/b&gt;&lt;b&gt;er&lt;a&gt;123&lt;/a&gt;&lt;/b&gt;

答案2

得分: 1

I tried to update the code from @Krystian G but my edit was rejected :-/ Therefore I post it as an own post. The code is an excellent starting point but it fails if between the tags a TextNode appears, e.g.

`&lt;span&gt; no class but further&lt;/span&gt; (in)valid &lt;span&gt;spanning&lt;/span&gt;` would result into a 

`&lt;span&gt; no class but furtherspanning&lt;/span&gt; (in)valid `

Therefore the corrected code looks like:

    public class StackOverflow60704600 {

        public static void main(final String[] args) throws IOException {
            String test1 = "&quot;&lt;b&gt;test&lt;/b&gt;&lt;b&gt;er&lt;/b&gt;&lt;span class=\\&quot;ab\\&quot;&gt;continue&lt;/span&gt;&lt;span&gt; without&lt;/span&gt;&quot;";
            String test2 = "&quot;&lt;b&gt;test&lt;/b&gt;&lt;b&gt;er&lt;a&gt;123&lt;/a&gt;&lt;/b&gt;&quot;";
            String test3 = "&quot;&lt;span&gt; no class but further&lt;/span&gt;   &lt;span&gt;spanning&lt;/span&gt;&quot;";
            String test4 = "&quot;&lt;span&gt; no class but further&lt;/span&gt; (in)valid &lt;span&gt;spanning&lt;/span&gt;&quot;";
            Document doc = Jsoup.parse(test1);
            mergeSiblings(doc, "b");
            System.out.println(doc);
        }

        private static void mergeSiblings(Document doc, String selector) {
            Elements elements = doc.select(selector);
            for (Element element : elements) {
                Node nextElement = element.nextSibling();
                // if the next Element is a TextNode but has only space ==&gt; we need to preserve the
                // spacing
                boolean addSpace = false;
                if (nextElement != null &amp;&amp; nextElement instanceof TextNode) {
                    String content = nextElement.toString();
                    if (!content.isBlank()) {
                        // the next element has some content
                        continue;
                    } else {
                        addSpace = true;
                    }
                }
                // get the next sibling
                Element nextSibling = element.nextElementSibling();
                // merge only if the next sibling has the same tag name and the same set of
                // attributes
                if (nextSibling != null &amp;&amp; nextSibling.tagName().equals(element.tagName())
                        &amp;&amp; nextSibling.attributes().equals(element.attributes())) {
                    // your element has only one child, but let&#39;s rewrite all of them if there&#39;s more
                    while (nextSibling.childNodes().size() &gt; 0) {
                        Node siblingChildNode = nextSibling.childNodes().get(0);
                        if (addSpace) {
                            // since we have had some space previously ==&gt; preserve it and add it
                            if (siblingChildNode instanceof TextNode) {
                                ((TextNode) siblingChildNode).text(" " + siblingChildNode.toString());
                            } else {
                                element.appendChild(new TextNode(" "));
                            }
                        }
                        element.appendChild(siblingChildNode);
                    }
                    // remove because now it doesn&#39;t have any children
                    nextSibling.remove();
                }
            }
        }
    }
英文:

I tried to update the code from @Krystian G but my edit was rejected :-/ Therefore I post it as an own post. The code is an excellent starting point but it fails if between the tags a TextNode appears, e.g.

&lt;span&gt; no class but further&lt;/span&gt; (in)valid &lt;span&gt;spanning&lt;/span&gt; would result into a

&lt;span&gt; no class but furtherspanning&lt;/span&gt; (in)valid

Therefore the corrected code looks like:

public class StackOverflow60704600 {
public static void main(final String[] args) throws IOException {
String test1=&quot;&lt;b&gt;test&lt;/b&gt;&lt;b&gt;er&lt;/b&gt;&lt;span class=\&quot;ab\&quot;&gt;continue&lt;/span&gt;&lt;span&gt; without&lt;/span&gt;&quot;;
String test2=&quot;&lt;b&gt;test&lt;/b&gt;&lt;b&gt;er&lt;a&gt;123&lt;/a&gt;&lt;/b&gt;&quot;;
String test3=&quot;&lt;span&gt; no class but further&lt;/span&gt;   &lt;span&gt;spanning&lt;/span&gt;&quot;;
String test4=&quot;&lt;span&gt; no class but further&lt;/span&gt; (in)valid &lt;span&gt;spanning&lt;/span&gt;&quot;;
Document doc = Jsoup.parse(test1);
mergeSiblings(doc, &quot;b&quot;);
System.out.println(doc);
}
private static void mergeSiblings(Document doc, String selector) {
Elements elements = doc.select(selector);
for (Element element : elements) {
Node nextElement = element.nextSibling();
// if the next Element is a TextNode but has only space ==&gt; we need to preserve the
// spacing
boolean addSpace = false;
if (nextElement != null &amp;&amp; nextElement instanceof TextNode) {
String content = nextElement.toString();
if (!content.isBlank()) {
// the next element has some content
continue;
} else {
addSpace = true;
}
}
// get the next sibling
Element nextSibling = element.nextElementSibling();
// merge only if the next sibling has the same tag name and the same set of
// attributes
if (nextSibling != null &amp;&amp; nextSibling.tagName().equals(element.tagName())
&amp;&amp; nextSibling.attributes().equals(element.attributes())) {
// your element has only one child, but let&#39;s rewrite all of them if there&#39;s more
while (nextSibling.childNodes().size() &gt; 0) {
Node siblingChildNode = nextSibling.childNodes().get(0);
if (addSpace) {
// since we have had some space previously ==&gt; preserve it and add it
if (siblingChildNode instanceof TextNode) {
((TextNode) siblingChildNode).text(&quot; &quot; + siblingChildNode.toString());
} else {
element.appendChild(new TextNode(&quot; &quot;));
}
}
element.appendChild(siblingChildNode);
}
// remove because now it doesn&#39;t have any children
nextSibling.remove();
}
}
}
}

huangapple
  • 本文由 发表于 2020年3月16日 18:53:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/60704600.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定