使用JSOUP解析HTML – 需要特定模式

huangapple go评论88阅读模式
英文:

Parse HTMl using JSOUP - Need specific pattern

问题

我正在尝试获取标签之间的文本并保存到某个变量中,例如:
在这里,我想保存<em>标签之间的值return。同时,我需要保存在<p>标签中的其余文本,
<em>标签的值被赋予了return
<p>标签的值应该返回只有--> an item, cancel an order, print a receipt, track your purchases or reorder items.

如果在<em>标签之前有一些值,那么即使这个值也应该保存在不同的变量中,基本上一个<p>如果它内部有多个标签,那么应该被拆分并保存到不同的变量中。如果我知道如何获取不在内部标签中的其余文本,我就可以检索其余的文本。

我已经写了以下代码:下面的代码只返回在<em>标签中的"return"。
在这里,ep基本上是doc.select(p),选择了<p>标签,然后进行迭代,不确定我是否在正确的方式下操作,非常感谢任何其他方法。

String text = "<p><em>return </em>an item, cancel an order, print a receipt, track your purchases or reorder items.</p>";

Elements italic_tags = ep.select("em");
for (Element em : italic_tags) { 
    if (em.tagName().equals("em")) {
        System.out.println(em.select("em").text());
    }
}

请注意,你提供的代码片段中的HTML标记似乎不是标准的HTML格式,因此在实际使用时可能需要进行调整。如果需要更多帮助,请提供更多上下文信息。

英文:

I am trying to get text between tags and save into some variable, for example:
Here I want to save value return which is between em tags. Also I need the rest of the text which is in p tags,
em tag value is assigned with return and
p tag value should return only --> an item, cancel an order, print a receipt, track your purchases or reorder items.
if some value is before em tag, even that value should be in different variable basically one p if it has multiple tags within then it should be split and save into different variables. If I know how can I get rest of text which are not in inner tags I can retrieve the rest.

I have written below: the below is returning just "return" which is in "'em' tags.
Here ep is basically doc.select(p), selecting p tag and then iterating, not sure if I am doing right way, any other approaches are highly appreciated.

String text =&quot;\&lt;p&gt;&lt;em&gt;return &lt;/em&gt;an item, cancel an order, print a receipt, track your purchases or reorder items.&lt;/p&gt;&quot;

Elements italic_tags = ep.select(&quot;em&quot;);
for(Element em:italic_tags) { 
 if(em.tagName().equals(&quot;em&quot;)) {
    System.out.println( em.select(&quot;em&quot;).text());
   }
}

答案1

得分: 0

Sure, here is the translated code portion you provided:

如果您需要选择不同标签包裹的每个子文本和文本您需要尝试选择 `Node` 而不是 `Element`。我修改了您的 HTML 以包含更多的标签以便示例更完整

String text = "&lt;p&gt;&lt;em&gt;return &lt;/em&gt;an item, &lt;em&gt;cancel&lt;/em&gt; an order, &lt;em&gt;print&lt;/em&gt; a receipt, &lt;em&gt;track&lt;/em&gt; your purchases or reorder items.&lt;/p&gt;";
Document doc = Jsoup.parse(text);

Element ep = doc.selectFirst("p");
List<Node> childNodes = ep.childNodes();
for (Node node : childNodes) {
    if (node instanceof TextNode) {
        // 如果是文本,只显示它
        System.out.println(node);
    } else {
        // 如果是另一个元素,则显示其第一个子元素,这在本例中是文本
        System.out.println(node.childNode(0));
    }
}

output:

return 
an item, 
cancel
 an order, 
print
 a receipt, 
track
 your purchases or reorder items.
英文:

If you need to select each sub text and text enclosed by different tags you need to try selecting Node instead of Element. I modified your HTML to include more tags so the example is more complete:

		String text = &quot;&lt;p&gt;&lt;em&gt;return &lt;/em&gt;an item, &lt;em&gt;cancel&lt;/em&gt; an order, &lt;em&gt;print&lt;/em&gt; a receipt, &lt;em&gt;track&lt;/em&gt; your purchases or reorder items.&lt;/p&gt;&quot;;
		Document doc = Jsoup.parse(text);

		Element ep = doc.selectFirst(&quot;p&quot;);
		List&lt;Node&gt; childNodes = ep.childNodes();
		for (Node node : childNodes) {
			if (node instanceof TextNode) {
				// if it&#39;s a text, just display it
				System.out.println(node);
			} else {
				// if it&#39;s another element, then display its first
				// child which in this case is a text
				System.out.println(node.childNode(0));
			}
		}

output:

return 
an item, 
cancel
 an order, 
print
 a receipt, 
track
 your purchases or reorder items.

</details>



huangapple
  • 本文由 发表于 2020年4月9日 10:13:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/61112897.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定