英文:
How to get all p tags between and after h2 tags in Jsoup
问题
我有以下HTML代码:
<h2 id="17273">bla bla bla 1</h2>
<p>我需要的文本</p>
<p>我需要的文本</p>
<p>我需要的文本</p>
<h2 id="45626">bla bla bla 2</h2>
<p>我需要的文本</p>
<p>我需要的文本</p>
<p>我需要的文本</p>
<h2 id="78519">bla bla bla 3</h2>
<p>我需要的文本</p>
<p>我需要的文本</p>
<h2 id="72725">bla bla bla 2</h2>
<p>我需要的文本</p>
<p>我需要的文本</p>
我想提取所有位于/在h2标签之后的p标签,并将其与上述的h2标签进行映射,如下所示:
[(具有id 17273的h2 = 在其下的所有p标签), (具有id 45626的h2 = 在其下的所有p标签)]
说实话,我不知道如何实现这一目标,我尝试过一些方法,如doc.siblingElements()
和其他方法,但未能实现这样的结果。
英文:
I have HTML like this
<h2 id="17273">bla bla bla 1</h2>
<p>Text i need</p>
<p>Text i need</p>
<p>Text i need</p>
<h2 id="45626">bla bla bla 2</h2>
<p>Text i need</p>
<p>Text i need</p>
<p>Text i need</p>
<h2 id="78519">bla bla bla 3</h2>
<p>Text i need</p>
<p>Text i need</p>
<h2 id="72725">bla bla bla 2</h2>
<p>Text i need</p>
<p>Text i need</p>
I want to extract all p tags after/between h2 tags and map it with the above h2 tags like this:
[(h2 with id 17273 = all p tags below it), (h2 with id 45626 = all p tags below it)]
To be honest, I don't know how to achieve that, I've tried few things like doc.siblingElements()
and some other things, but I was not able to achieve something like that.
答案1
得分: 0
自 < h2 > 和 < p > 标签没有以任何方式相互关联,您可以使用正则表达式来人工创建它们之间的依赖关系:
String x = html //your html String
.replaceAll("</p>\\s+<h2", "</p></parent>\n<h2")
.replaceAll("<h2", "<parent><h2")
+ "</parent>";
然后,使用 Jsoup 相对简单:
Document doc = Jsoup.parse(x);
Elements parents = doc.getElementsByTag("parent");
for (Element e : parents) {
Elements h2 = e.getElementsByAttribute("id");
String id = h2.attr("id");
Elements pElements = e.getElementsByTag("p");
List<String> pList = new ArrayList<>();
for (Element p : pElements)
pList.add(p.text());
System.out.println("h2 with id " + id + " = " + pList);
}
输出结果:
h2 with id 17273 = [1 Text i need, 1 Text i need, 1 Text i need]
h2 with id 45626 = [2 Text i need, 2 Text i need, 2 Text i need]
h2 with id 78519 = [3 Text i need, 3 Text i need]
h2 with id 72725 = [4 Text i need, 4 Text i need]
英文:
Since the < h2 > and < p > tags are not linked in any way, you can use regex to artificially create dependencies between them:
String x = html //your html String
.replaceAll("</p>\\s+<h2", "</p></parent>\n<h2")
.replaceAll("<h2", "<parent><h2")
+ "</parent>";
Then using Jsoup is relatively simple:
Document doc = Jsoup.parse(x);
Elements parents = doc.getElementsByTag("parent");
for (Element e : parents) {
Elements h2 = e.getElementsByAttribute("id");
String id = h2.attr("id");
Elements pElements = e.getElementsByTag("p");
List<String> pList = new ArrayList<>();
for (Element p : pElements)
pList.add(p.text());
System.out.println("h2 with id " + id + " = " + pList);
}
The output received:
h2 with id 17273 = [1 Text i need, 1 Text i need, 1 Text i need]
h2 with id 45626 = [2 Text i need, 2 Text i need, 2 Text i need]
h2 with id 78519 = [3 Text i need, 3 Text i need]
h2 with id 72725 = [4 Text i need, 4 Text i need]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论