合并JSoup中相同的元素

huangapple go评论133阅读模式
英文:

Merging same elements in JSoup

问题

我有类似以下的HTML字符串:

  1. <b>test</b><b>er</b>
  2. <span class="ab">continue</span><span> without</span>

我想要合并相似且属于同一组的标签。在上面的示例中,我想得到:

  1. <b>tester</b>

因为这些标签拥有相同的标签且没有其他属性或样式。但对于span标签,它应该保持不变,因为它有一个class属性。我知道我可以通过Jsoup遍历整个树来实现。

  1. Document doc = Jsoup.parse(input);
  2. for (Element element : doc.select("b")) {
  3. }

但我不清楚如何向前查找(我猜类似于nextSibling),然后如何合并这些元素?

或者是否存在一个简单的正则表达式合并方法?

我可以自己指定属性。不需要一个适用于所有标签的解决方案。

英文:

I have the HTML string like

  1. <b>test</b><b>er</b>
  2. <span class="ab">continue</span><span> without</span>

I want to collapse the Tags which are similar and belong to each other. In the above sample I want to have

  1. <b>tester</b>

since the tags have the same tag withouth any further attribute or style. But for the span Tag it should remain the same because it has a class attribute. I am aware that I can iterate via Jsoup over the tree.

  1. Document doc = Jsoup.parse(input);
  2. for (Element element : doc.select("b")) {
  3. }

But I'm not clear how look forward (I guess something like nextSibling) but than how to collapse the elements?

Or exists a simple regexp merge?

The attributes I can specify on my own. It's not required to have a one-fits-for-all Tag solution.

答案1

得分: 1

  1. public class StackOverflow60704600 {
  2. public static void main(final String[] args) throws IOException {
  3. Document doc = Jsoup.parse("<b>test</b><b>er</b><span class=\"ab\">continue</span><span> without</span>");
  4. mergeSiblings(doc, "b");
  5. System.out.println(doc);
  6. }
  7. private static void mergeSiblings(Document doc, String selector) {
  8. Elements elements = doc.select(selector);
  9. for (Element element : elements) {
  10. Element nextSibling = element.nextElementSibling();
  11. if (nextSibling != null && nextSibling.tagName().equals(element.tagName())
  12. && nextSibling.attributes().equals(element.attributes())) {
  13. while (nextSibling.childNodes().size() > 0) {
  14. Node siblingChildNode = nextSibling.childNodes().get(0);
  15. element.appendChild(siblingChildNode);
  16. }
  17. nextSibling.remove();
  18. }
  19. }
  20. }
  21. }

Output:

  1. <html>
  2. <head></head>
  3. <body>
  4. <b>tester</b>
  5. <span class="ab">continue</span>
  6. <span> without</span>
  7. </body>
  8. </html>

One more note on why I used loop while (nextSibling.childNodes().size() > 0). It turned out for or iterator couldn't be used here because appendChild adds the child but removes it from the source element and remaining children are shifted. It may not be visible here, but the problem will appear when you try to merge: <b>test</b><b>er<a>123</a></b>.

英文:

My approach would be like this. Comments in the code

  1. public class StackOverflow60704600 {
  2. public static void main(final String[] args) throws IOException {
  3. Document doc = Jsoup.parse(&quot;&lt;b&gt;test&lt;/b&gt;&lt;b&gt;er&lt;/b&gt;&lt;span class=\&quot;ab\&quot;&gt;continue&lt;/span&gt;&lt;span&gt; without&lt;/span&gt;&quot;);
  4. mergeSiblings(doc, &quot;b&quot;);
  5. System.out.println(doc);
  6. }
  7. private static void mergeSiblings(Document doc, String selector) {
  8. Elements elements = doc.select(selector);
  9. for (Element element : elements) {
  10. // get the next sibling
  11. Element nextSibling = element.nextElementSibling();
  12. // merge only if the next sibling has the same tag name and the same set of attributes
  13. if (nextSibling != null &amp;&amp; nextSibling.tagName().equals(element.tagName())
  14. &amp;&amp; nextSibling.attributes().equals(element.attributes())) {
  15. // your element has only one child, but let&#39;s rewrite all of them if there&#39;s more
  16. while (nextSibling.childNodes().size() &gt; 0) {
  17. Node siblingChildNode = nextSibling.childNodes().get(0);
  18. element.appendChild(siblingChildNode);
  19. }
  20. // remove because now it doesn&#39;t have any children
  21. nextSibling.remove();
  22. }
  23. }
  24. }
  25. }

output:

  1. &lt;html&gt;
  2. &lt;head&gt;&lt;/head&gt;
  3. &lt;body&gt;
  4. &lt;b&gt;tester&lt;/b&gt;
  5. &lt;span class=&quot;ab&quot;&gt;continue&lt;/span&gt;
  6. &lt;span&gt; without&lt;/span&gt;
  7. &lt;/body&gt;
  8. &lt;/html&gt;

One more note on why I used loop while (nextSibling.childNodes().size() &gt; 0). It turned out for or iterator couldn't be used here because appendChild adds the child but removes it from the source element and remaining childen are be shifted. It may not be visible here but the problem will appear when you try to merge: &lt;b&gt;test&lt;/b&gt;&lt;b&gt;er&lt;a&gt;123&lt;/a&gt;&lt;/b&gt;

答案2

得分: 1

  1. I tried to update the code from @Krystian G but my edit was rejected :-/ Therefore I post it as an own post. The code is an excellent starting point but it fails if between the tags a TextNode appears, e.g.
  2. `&lt;span&gt; no class but further&lt;/span&gt; (in)valid &lt;span&gt;spanning&lt;/span&gt;` would result into a
  3. `&lt;span&gt; no class but furtherspanning&lt;/span&gt; (in)valid `
  4. Therefore the corrected code looks like:
  5. public class StackOverflow60704600 {
  6. public static void main(final String[] args) throws IOException {
  7. String test1 = "&quot;&lt;b&gt;test&lt;/b&gt;&lt;b&gt;er&lt;/b&gt;&lt;span class=\\&quot;ab\\&quot;&gt;continue&lt;/span&gt;&lt;span&gt; without&lt;/span&gt;&quot;";
  8. String test2 = "&quot;&lt;b&gt;test&lt;/b&gt;&lt;b&gt;er&lt;a&gt;123&lt;/a&gt;&lt;/b&gt;&quot;";
  9. String test3 = "&quot;&lt;span&gt; no class but further&lt;/span&gt; &lt;span&gt;spanning&lt;/span&gt;&quot;";
  10. String test4 = "&quot;&lt;span&gt; no class but further&lt;/span&gt; (in)valid &lt;span&gt;spanning&lt;/span&gt;&quot;";
  11. Document doc = Jsoup.parse(test1);
  12. mergeSiblings(doc, "b");
  13. System.out.println(doc);
  14. }
  15. private static void mergeSiblings(Document doc, String selector) {
  16. Elements elements = doc.select(selector);
  17. for (Element element : elements) {
  18. Node nextElement = element.nextSibling();
  19. // if the next Element is a TextNode but has only space ==&gt; we need to preserve the
  20. // spacing
  21. boolean addSpace = false;
  22. if (nextElement != null &amp;&amp; nextElement instanceof TextNode) {
  23. String content = nextElement.toString();
  24. if (!content.isBlank()) {
  25. // the next element has some content
  26. continue;
  27. } else {
  28. addSpace = true;
  29. }
  30. }
  31. // get the next sibling
  32. Element nextSibling = element.nextElementSibling();
  33. // merge only if the next sibling has the same tag name and the same set of
  34. // attributes
  35. if (nextSibling != null &amp;&amp; nextSibling.tagName().equals(element.tagName())
  36. &amp;&amp; nextSibling.attributes().equals(element.attributes())) {
  37. // your element has only one child, but let&#39;s rewrite all of them if there&#39;s more
  38. while (nextSibling.childNodes().size() &gt; 0) {
  39. Node siblingChildNode = nextSibling.childNodes().get(0);
  40. if (addSpace) {
  41. // since we have had some space previously ==&gt; preserve it and add it
  42. if (siblingChildNode instanceof TextNode) {
  43. ((TextNode) siblingChildNode).text(" " + siblingChildNode.toString());
  44. } else {
  45. element.appendChild(new TextNode(" "));
  46. }
  47. }
  48. element.appendChild(siblingChildNode);
  49. }
  50. // remove because now it doesn&#39;t have any children
  51. nextSibling.remove();
  52. }
  53. }
  54. }
  55. }
英文:

I tried to update the code from @Krystian G but my edit was rejected :-/ Therefore I post it as an own post. The code is an excellent starting point but it fails if between the tags a TextNode appears, e.g.

&lt;span&gt; no class but further&lt;/span&gt; (in)valid &lt;span&gt;spanning&lt;/span&gt; would result into a

&lt;span&gt; no class but furtherspanning&lt;/span&gt; (in)valid

Therefore the corrected code looks like:

  1. public class StackOverflow60704600 {
  2. public static void main(final String[] args) throws IOException {
  3. String test1=&quot;&lt;b&gt;test&lt;/b&gt;&lt;b&gt;er&lt;/b&gt;&lt;span class=\&quot;ab\&quot;&gt;continue&lt;/span&gt;&lt;span&gt; without&lt;/span&gt;&quot;;
  4. String test2=&quot;&lt;b&gt;test&lt;/b&gt;&lt;b&gt;er&lt;a&gt;123&lt;/a&gt;&lt;/b&gt;&quot;;
  5. String test3=&quot;&lt;span&gt; no class but further&lt;/span&gt; &lt;span&gt;spanning&lt;/span&gt;&quot;;
  6. String test4=&quot;&lt;span&gt; no class but further&lt;/span&gt; (in)valid &lt;span&gt;spanning&lt;/span&gt;&quot;;
  7. Document doc = Jsoup.parse(test1);
  8. mergeSiblings(doc, &quot;b&quot;);
  9. System.out.println(doc);
  10. }
  11. private static void mergeSiblings(Document doc, String selector) {
  12. Elements elements = doc.select(selector);
  13. for (Element element : elements) {
  14. Node nextElement = element.nextSibling();
  15. // if the next Element is a TextNode but has only space ==&gt; we need to preserve the
  16. // spacing
  17. boolean addSpace = false;
  18. if (nextElement != null &amp;&amp; nextElement instanceof TextNode) {
  19. String content = nextElement.toString();
  20. if (!content.isBlank()) {
  21. // the next element has some content
  22. continue;
  23. } else {
  24. addSpace = true;
  25. }
  26. }
  27. // get the next sibling
  28. Element nextSibling = element.nextElementSibling();
  29. // merge only if the next sibling has the same tag name and the same set of
  30. // attributes
  31. if (nextSibling != null &amp;&amp; nextSibling.tagName().equals(element.tagName())
  32. &amp;&amp; nextSibling.attributes().equals(element.attributes())) {
  33. // your element has only one child, but let&#39;s rewrite all of them if there&#39;s more
  34. while (nextSibling.childNodes().size() &gt; 0) {
  35. Node siblingChildNode = nextSibling.childNodes().get(0);
  36. if (addSpace) {
  37. // since we have had some space previously ==&gt; preserve it and add it
  38. if (siblingChildNode instanceof TextNode) {
  39. ((TextNode) siblingChildNode).text(&quot; &quot; + siblingChildNode.toString());
  40. } else {
  41. element.appendChild(new TextNode(&quot; &quot;));
  42. }
  43. }
  44. element.appendChild(siblingChildNode);
  45. }
  46. // remove because now it doesn&#39;t have any children
  47. nextSibling.remove();
  48. }
  49. }
  50. }
  51. }

huangapple
  • 本文由 发表于 2020年3月16日 18:53:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/60704600.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定