英文:
Getting an element's subtending text nodes (including indirect)
问题
我们需要检索元素的所有子文本节点,无论是直接的还是间接的。Element
类上的 textNodes()
方法仅返回直接子文本节点,而不返回更深层次的子文本节点,如子的子文本节点、子的子的子文本节点等。
给定以下示例HTML文件:
<html>
<head/>
<body>
<div class="erece mtmhp">
<a href="http://www.stackoverflow.com">
<span>Content 1</span>
<span>Content 2</span>
</a>
</div>
</body>
</html>
我想要能够分别检索 Content 1
和 Content 2
。
以下是我的示例代码:
import java.io.File;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
public class TextNodeExpt {
public static void main(String[] args) throws Exception {
File fileObj = new File(args[0]);
Document document = Jsoup.parse(fileObj, "UTF-8");
Elements divs = document.select("div.erece.mtmhp");
displaySubtendingTextNodes(divs);
}
protected static void displaySubtendingTextNodes(Elements divs) {
for (Element div : divs) {
Elements anchors = div.select("a");
for (Element anchor : anchors) {
List<TextNode> textNodes = anchor.textNodes();
System.out.println(textNodes.size() + " text nodes found");
for (TextNode tn : textNodes) {
System.out.println("[" + tn.getWholeText() + "]");
}
}
}
}
}
补充说明:根据 Hovercraft Full Of Eels 的评论,我已经创建了以下实现。它基于 此 Stackoverflow 帖子。
我使用了 API 中的 List<TextNode>
,类似于 Element
类上的 textNodes()
方法(尽管列表作为参数传递,而不是从方法返回)。
protected static void textNodes(Node targetNode, List<TextNode> nodeList) {
for (Node childNode : targetNode.childNodes()) {
if (childNode instanceof TextNode && !((TextNode) childNode).isBlank()) {
nodeList.add((TextNode) childNode);
} else {
textNodes(childNode, nodeList);
}
}
}
不含代码部分的翻译已提供。
英文:
We have a need to retrieve all the subtending text nodes of an element, whether direct or indirect. The textNodes()
method on the Element
class is returning only the direct child text nodes, and not the grand text nodes, great-grand text nodes etc.
Given the following sample HTML file:
<html>
<head/>
<body>
<div class="erece mtmhp">
<a href="http://www.stackoverflow.com">
<span>Content 1</span>
<span>Content 2</span>
</a>
</div>
</body>
</html>
I would like to be able to retrieve Content 1
and Content 2
, but separately.
Here is my sample code:
import java.io.File;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
public class TextNodeExpt {
public static void main(String[] args) throws Exception {
File fileObj = new File(args[0]);
Document document = Jsoup.parse(fileObj, "UTF-8");
Elements divs = document.select("div.erece.mtmhp");
displaySubtendingTextNodes(divs);
}
protected static void displaySubtendingTextNodes(Elements divs) {
for (Element div : divs) {
Elements anchors = div.select("a");
for (Element anchor : anchors) {
List<TextNode> textNodes = anchor.textNodes();
System.out.println(textNodes.size() + " text nodes found");
for(TextNode tn : textNodes) {
System.out.println("[" + tn.getWholeText() + "]");
}
}
}
}
}
Addendum:
Based on the comment from Hovercraft Full Of Eels, I have come up with the following implementation. It is based on this Stackoverflow posting.
I have made use of List<TextNode>
in the API, similar to the textNodes()
method on the Element
class (although the list is passed as a parameter, and not returned from the method).
protected static void textNodes(Node targetNode, List<TextNode> nodeList) {
for (Node childNode : targetNode.childNodes()) {
if (childNode instanceof TextNode && !((TextNode) childNode).isBlank()) {
nodeList.add((TextNode)childNode);
}
else {
textNodes(childNode, nodeList);
}
}
}
答案1
得分: 1
以下是翻译好的部分:
"retrieve separately" - 但是,如果您想要控制和自定义从元素接收文本的方式,那么您应该创建一个实现NodeVisitor接口的单独类,并定义提取方法,以便将来不难分别操作"Content 1"和"Content 2",甚至将这些行放入集合中。
Element targetElement = doc.select("div.erece").first();
TextNodeVisitor textNodeVisitor = new TextNodeVisitor();
NodeTraversor traversor = new NodeTraversor(textNodeVisitor);
traversor.traverse(targetElement);
String extractedText = textNodeVisitor.getExtractedText();
System.out.println(extractedText);
}
static class TextNodeVisitor implements NodeVisitor {
private StringBuilder extractedText = new StringBuilder();
@Override
public void head(Node node, int depth) {
if (node instanceof org.jsoup.nodes.TextNode) {
org.jsoup.nodes.TextNode textNode = (org.jsoup.nodes.TextNode) node;
String text = textNode.text().trim();
if (!text.isEmpty()) {
extractedText.append(text).append("\n");
}
}
}
@Override
public void tail(Node node, int depth) {
// Do nothing on tail
}
public String getExtractedText() {
return extractedText.toString();
}
public List<String> getLineList(){
List<String> stringList = List.of(extractedText.toString().split("\n"));
return stringList;
}
}
希望这对您有所帮助。
英文:
It's not entirely clear what you mean - "retrieve separately" - However, if you want to control and customize the way the text is received from the element, then you should create a separate class implementing the NodeVisitor interface and define the extraction method so that it would not be challenging to manipulate strings in the future "Content 1" and "Content 2" separately, or even put such lines into a collection.
Element targetElement = doc.select("div.erece").first();
TextNodeVisitor textNodeVisitor = new TextNodeVisitor();
NodeTraversor traversor = new NodeTraversor(textNodeVisitor);
traversor.traverse(targetElement);
String extractedText = textNodeVisitor.getExtractedText();
System.out.println(extractedText);
}
static class TextNodeVisitor implements NodeVisitor {
private StringBuilder extractedText = new StringBuilder();
@Override
public void head(Node node, int depth) {
if (node instanceof org.jsoup.nodes.TextNode) {
org.jsoup.nodes.TextNode textNode = (org.jsoup.nodes.TextNode) node;
String text = textNode.text().trim();
if (!text.isEmpty()) {
extractedText.append(text).append("\n");
}
}
}
@Override
public void tail(Node node, int depth) {
// Do nothing on tail
}
public String getExtractedText() {
return extractedText.toString();
}
public List<String> getLineList(){
List<String> stringList = List.of(extractedText.toString().split("\n"));
return stringList;
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论