获取Jsoup中位于

标签之间和之后的所有

标签。

huangapple go评论109阅读模式
英文:

How to get all p tags between and after h2 tags in Jsoup

问题

我有以下HTML代码:

  1. <h2 id="17273">bla bla bla 1</h2>
  2. <p>我需要的文本</p>
  3. <p>我需要的文本</p>
  4. <p>我需要的文本</p>
  5. <h2 id="45626">bla bla bla 2</h2>
  6. <p>我需要的文本</p>
  7. <p>我需要的文本</p>
  8. <p>我需要的文本</p>
  9. <h2 id="78519">bla bla bla 3</h2>
  10. <p>我需要的文本</p>
  11. <p>我需要的文本</p>
  12. <h2 id="72725">bla bla bla 2</h2>
  13. <p>我需要的文本</p>
  14. <p>我需要的文本</p>

我想提取所有位于/在h2标签之后的p标签,并将其与上述的h2标签进行映射,如下所示:

  1. [(具有id 17273h2 = 在其下的所有p标签), (具有id 45626h2 = 在其下的所有p标签)]

说实话,我不知道如何实现这一目标,我尝试过一些方法,如doc.siblingElements()和其他方法,但未能实现这样的结果。

英文:

I have HTML like this

  1. <h2 id="17273">bla bla bla 1</h2>
  2. <p>Text i need</p>
  3. <p>Text i need</p>
  4. <p>Text i need</p>
  5. <h2 id="45626">bla bla bla 2</h2>
  6. <p>Text i need</p>
  7. <p>Text i need</p>
  8. <p>Text i need</p>
  9. <h2 id="78519">bla bla bla 3</h2>
  10. <p>Text i need</p>
  11. <p>Text i need</p>
  12. <h2 id="72725">bla bla bla 2</h2>
  13. <p>Text i need</p>
  14. <p>Text i need</p>

I want to extract all p tags after/between h2 tags and map it with the above h2 tags like this:

  1. [(h2 with id 17273 = all p tags below it), (h2 with id 45626 = all p tags below it)]

To be honest, I don't know how to achieve that, I've tried few things like doc.siblingElements() and some other things, but I was not able to achieve something like that.

答案1

得分: 0

< h2 >< p > 标签没有以任何方式相互关联,您可以使用正则表达式来人工创建它们之间的依赖关系:

  1. String x = html //your html String
  2. .replaceAll("&lt;/p&gt;\\s+&lt;h2", "&lt;/p&gt;&lt;/parent&gt;\n&lt;h2")
  3. .replaceAll("&lt;h2", "&lt;parent&gt;&lt;h2")
  4. + "&lt;/parent&gt;";

然后,使用 Jsoup 相对简单:

  1. Document doc = Jsoup.parse(x);
  2. Elements parents = doc.getElementsByTag("parent");
  3. for (Element e : parents) {
  4. Elements h2 = e.getElementsByAttribute("id");
  5. String id = h2.attr("id");
  6. Elements pElements = e.getElementsByTag("p");
  7. List<String> pList = new ArrayList<>();
  8. for (Element p : pElements)
  9. pList.add(p.text());
  10. System.out.println("h2 with id " + id + " = " + pList);
  11. }

输出结果:

h2 with id 17273 = [1 Text i need, 1 Text i need, 1 Text i need]

h2 with id 45626 = [2 Text i need, 2 Text i need, 2 Text i need]

h2 with id 78519 = [3 Text i need, 3 Text i need]

h2 with id 72725 = [4 Text i need, 4 Text i need]

英文:

Since the < h2 > and < p > tags are not linked in any way, you can use regex to artificially create dependencies between them:

  1. String x = html //your html String
  2. .replaceAll(&quot;&lt;/p&gt;\\s+&lt;h2&quot;, &quot;&lt;/p&gt;&lt;/parent&gt;\n&lt;h2&quot;)
  3. .replaceAll(&quot;&lt;h2&quot;, &quot;&lt;parent&gt;&lt;h2&quot;)
  4. + &quot;&lt;/parent&gt;&quot;;

Then using Jsoup is relatively simple:

  1. Document doc = Jsoup.parse(x);
  2. Elements parents = doc.getElementsByTag(&quot;parent&quot;);
  3. for (Element e : parents) {
  4. Elements h2 = e.getElementsByAttribute(&quot;id&quot;);
  5. String id = h2.attr(&quot;id&quot;);
  6. Elements pElements = e.getElementsByTag(&quot;p&quot;);
  7. List&lt;String&gt; pList = new ArrayList&lt;&gt;();
  8. for (Element p : pElements)
  9. pList.add(p.text());
  10. System.out.println(&quot;h2 with id &quot; + id + &quot; = &quot; + pList);
  11. }

The output received:

h2 with id 17273 = [1 Text i need, 1 Text i need, 1 Text i need]

h2 with id 45626 = [2 Text i need, 2 Text i need, 2 Text i need]

h2 with id 78519 = [3 Text i need, 3 Text i need]

h2 with id 72725 = [4 Text i need, 4 Text i need]

huangapple
  • 本文由 发表于 2023年2月14日 00:27:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438640.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定