获取Jsoup中位于

标签之间和之后的所有

标签。

huangapple go评论80阅读模式
英文:

How to get all p tags between and after h2 tags in Jsoup

问题

我有以下HTML代码:

<h2 id="17273">bla bla bla 1</h2>
<p>我需要的文本</p>
<p>我需要的文本</p>
<p>我需要的文本</p>
<h2 id="45626">bla bla bla 2</h2>
<p>我需要的文本</p>
<p>我需要的文本</p>
<p>我需要的文本</p>
<h2 id="78519">bla bla bla 3</h2>
<p>我需要的文本</p>
<p>我需要的文本</p>
<h2 id="72725">bla bla bla 2</h2>
<p>我需要的文本</p>
<p>我需要的文本</p>

我想提取所有位于/在h2标签之后的p标签,并将其与上述的h2标签进行映射,如下所示:

[(具有id 17273的h2 = 在其下的所有p标签), (具有id 45626的h2 = 在其下的所有p标签)]

说实话,我不知道如何实现这一目标,我尝试过一些方法,如doc.siblingElements()和其他方法,但未能实现这样的结果。

英文:

I have HTML like this

<h2 id="17273">bla bla bla 1</h2>
<p>Text i need</p>
<p>Text i need</p>
<p>Text i need</p>
<h2 id="45626">bla bla bla 2</h2>
<p>Text i need</p>
<p>Text i need</p>
<p>Text i need</p>
<h2 id="78519">bla bla bla 3</h2>
<p>Text i need</p>
<p>Text i need</p>
<h2 id="72725">bla bla bla 2</h2>
<p>Text i need</p>
<p>Text i need</p>

I want to extract all p tags after/between h2 tags and map it with the above h2 tags like this:

[(h2 with id 17273 = all p tags below it), (h2 with id 45626 = all p tags below it)]

To be honest, I don't know how to achieve that, I've tried few things like doc.siblingElements() and some other things, but I was not able to achieve something like that.

答案1

得分: 0

< h2 >< p > 标签没有以任何方式相互关联,您可以使用正则表达式来人工创建它们之间的依赖关系:

String x = html //your html String
    .replaceAll("&lt;/p&gt;\\s+&lt;h2", "&lt;/p&gt;&lt;/parent&gt;\n&lt;h2")
    .replaceAll("&lt;h2", "&lt;parent&gt;&lt;h2")
    + "&lt;/parent&gt;";

然后,使用 Jsoup 相对简单:

Document doc = Jsoup.parse(x);
Elements parents = doc.getElementsByTag("parent");
for (Element e : parents) {
    Elements h2 = e.getElementsByAttribute("id");
    String id = h2.attr("id");

    Elements pElements = e.getElementsByTag("p");
    List<String> pList = new ArrayList<>();
    for (Element p : pElements)
        pList.add(p.text());

    System.out.println("h2 with id " + id + " = " + pList);
}

输出结果:

h2 with id 17273 = [1 Text i need, 1 Text i need, 1 Text i need]

h2 with id 45626 = [2 Text i need, 2 Text i need, 2 Text i need]

h2 with id 78519 = [3 Text i need, 3 Text i need]

h2 with id 72725 = [4 Text i need, 4 Text i need]

英文:

Since the < h2 > and < p > tags are not linked in any way, you can use regex to artificially create dependencies between them:

        String x = html //your html String
            .replaceAll(&quot;&lt;/p&gt;\\s+&lt;h2&quot;, &quot;&lt;/p&gt;&lt;/parent&gt;\n&lt;h2&quot;)
            .replaceAll(&quot;&lt;h2&quot;, &quot;&lt;parent&gt;&lt;h2&quot;)
            + &quot;&lt;/parent&gt;&quot;;

Then using Jsoup is relatively simple:

    Document doc = Jsoup.parse(x);
    Elements parents = doc.getElementsByTag(&quot;parent&quot;);
    for (Element e : parents) {
        Elements h2 = e.getElementsByAttribute(&quot;id&quot;);
        String id = h2.attr(&quot;id&quot;);

        Elements pElements = e.getElementsByTag(&quot;p&quot;);
        List&lt;String&gt; pList = new ArrayList&lt;&gt;();
        for (Element p : pElements)
            pList.add(p.text());

        System.out.println(&quot;h2 with id &quot; + id + &quot; = &quot; + pList);
    }

The output received:

h2 with id 17273 = [1 Text i need, 1 Text i need, 1 Text i need]

h2 with id 45626 = [2 Text i need, 2 Text i need, 2 Text i need]

h2 with id 78519 = [3 Text i need, 3 Text i need]

h2 with id 72725 = [4 Text i need, 4 Text i need]

huangapple
  • 本文由 发表于 2023年2月14日 00:27:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438640.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定