获取元素的子文本节点(包括间接的)

huangapple go评论99阅读模式
英文:

Getting an element's subtending text nodes (including indirect)

问题

我们需要检索元素的所有子文本节点,无论是直接的还是间接的。Element 类上的 textNodes() 方法仅返回直接子文本节点,而不返回更深层次的子文本节点,如子的子文本节点、子的子的子文本节点等。

给定以下示例HTML文件:

<html>
    <head/>
    <body>
        <div class="erece mtmhp">
            <a href="http://www.stackoverflow.com">
                <span>Content 1</span>
                <span>Content 2</span>
            </a>
        </div>
    </body>
</html>

我想要能够分别检索 Content 1Content 2

以下是我的示例代码:

import java.io.File;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;

public class TextNodeExpt {
    public static void main(String[] args) throws Exception {
        File fileObj = new File(args[0]);
        Document document = Jsoup.parse(fileObj, "UTF-8");
        Elements divs = document.select("div.erece.mtmhp");
        displaySubtendingTextNodes(divs);
    }

    protected static void displaySubtendingTextNodes(Elements divs) {
        for (Element div : divs) {
            Elements anchors = div.select("a");
            for (Element anchor : anchors) {
                List<TextNode> textNodes = anchor.textNodes();
                System.out.println(textNodes.size() + " text nodes found");
                for (TextNode tn : textNodes) {
                    System.out.println("[" + tn.getWholeText() + "]");
                }
            }
        }
    }
}

补充说明:根据 Hovercraft Full Of Eels 的评论,我已经创建了以下实现。它基于 Stackoverflow 帖子。

我使用了 API 中的 List<TextNode>,类似于 Element 类上的 textNodes() 方法(尽管列表作为参数传递,而不是从方法返回)。

protected static void textNodes(Node targetNode, List<TextNode> nodeList) {
    for (Node childNode : targetNode.childNodes()) {
        if (childNode instanceof TextNode && !((TextNode) childNode).isBlank()) {
            nodeList.add((TextNode) childNode);
        } else {
            textNodes(childNode, nodeList);
        }
    }
}

不含代码部分的翻译已提供。

英文:

We have a need to retrieve all the subtending text nodes of an element, whether direct or indirect. The textNodes() method on the Element class is returning only the direct child text nodes, and not the grand text nodes, great-grand text nodes etc.

Given the following sample HTML file:

&lt;html&gt;
	&lt;head/&gt;
	&lt;body&gt;
		&lt;div class=&quot;erece mtmhp&quot;&gt;
			&lt;a href=&quot;http://www.stackoverflow.com&quot;&gt;
				&lt;span&gt;Content 1&lt;/span&gt;
				&lt;span&gt;Content 2&lt;/span&gt;
			&lt;/a&gt;
		&lt;/div&gt;
	&lt;/body&gt;
&lt;/html&gt;

I would like to be able to retrieve Content 1 and Content 2, but separately.

Here is my sample code:

import java.io.File;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;

public class TextNodeExpt {
    public static void main(String[] args) throws Exception {

		File fileObj = new File(args[0]);
        Document document = Jsoup.parse(fileObj, &quot;UTF-8&quot;);

        Elements divs = document.select(&quot;div.erece.mtmhp&quot;);

        displaySubtendingTextNodes(divs);
    }

	protected static void displaySubtendingTextNodes(Elements divs) {
		for (Element div : divs) {
            Elements anchors = div.select(&quot;a&quot;);
            for (Element anchor : anchors) {
				List&lt;TextNode&gt; textNodes = anchor.textNodes();
				System.out.println(textNodes.size() + &quot; text nodes found&quot;);
				for(TextNode tn : textNodes) {
					System.out.println(&quot;[&quot; + tn.getWholeText() + &quot;]&quot;);
					}
            }
        }
	}
}

Addendum:
Based on the comment from Hovercraft Full Of Eels, I have come up with the following implementation. It is based on this Stackoverflow posting.

I have made use of List&lt;TextNode&gt; in the API, similar to the textNodes() method on the Element class (although the list is passed as a parameter, and not returned from the method).

protected static void textNodes(Node targetNode, List&lt;TextNode&gt; nodeList) {
    for (Node childNode : targetNode.childNodes()) {
        if (childNode instanceof TextNode &amp;&amp; !((TextNode) childNode).isBlank()) {
        	nodeList.add((TextNode)childNode);
        }
        else {
        	textNodes(childNode, nodeList);
        }
    }
}

答案1

得分: 1

以下是翻译好的部分:

"retrieve separately" - 但是,如果您想要控制和自定义从元素接收文本的方式,那么您应该创建一个实现NodeVisitor接口的单独类,并定义提取方法,以便将来不难分别操作"Content 1"和"Content 2",甚至将这些行放入集合中。

Element targetElement = doc.select("div.erece").first();
TextNodeVisitor textNodeVisitor = new TextNodeVisitor();
NodeTraversor traversor = new NodeTraversor(textNodeVisitor);
traversor.traverse(targetElement);
String extractedText = textNodeVisitor.getExtractedText();
System.out.println(extractedText);
}

static class TextNodeVisitor implements NodeVisitor {
    private StringBuilder extractedText = new StringBuilder();
    @Override
    public void head(Node node, int depth) {
        if (node instanceof org.jsoup.nodes.TextNode) {
            org.jsoup.nodes.TextNode textNode = (org.jsoup.nodes.TextNode) node;
            String text = textNode.text().trim();
            if (!text.isEmpty()) {
                extractedText.append(text).append("\n");
            }
        }
    }

    @Override
    public void tail(Node node, int depth) {
        // Do nothing on tail
    }

    public String getExtractedText() {
        return extractedText.toString();
    }

    public List<String> getLineList(){
        List<String> stringList = List.of(extractedText.toString().split("\n"));
        return stringList;
    }
}

希望这对您有所帮助。

英文:

It's not entirely clear what you mean - "retrieve separately" - However, if you want to control and customize the way the text is received from the element, then you should create a separate class implementing the NodeVisitor interface and define the extraction method so that it would not be challenging to manipulate strings in the future "Content 1" and "Content 2" separately, or even put such lines into a collection.

      Element targetElement = doc.select(&quot;div.erece&quot;).first();
    TextNodeVisitor textNodeVisitor = new TextNodeVisitor();
    NodeTraversor traversor = new NodeTraversor(textNodeVisitor);
        traversor.traverse(targetElement);
    String extractedText = textNodeVisitor.getExtractedText();
        System.out.println(extractedText);
}
static class TextNodeVisitor implements NodeVisitor {
    private StringBuilder extractedText = new StringBuilder();
    @Override
    public void head(Node node, int depth) {
        if (node instanceof org.jsoup.nodes.TextNode) {
            org.jsoup.nodes.TextNode textNode = (org.jsoup.nodes.TextNode) node;
            String text = textNode.text().trim();
            if (!text.isEmpty()) {
                extractedText.append(text).append(&quot;\n&quot;);
            }
        }
    }

    @Override
    public void tail(Node node, int depth) {
        // Do nothing on tail
    }

    public String getExtractedText() {
        return extractedText.toString();
    }

    public List&lt;String&gt; getLineList(){
        List&lt;String&gt; stringList = List.of(extractedText.toString().split(&quot;\n&quot;));
        return stringList;
    } 

huangapple
  • 本文由 发表于 2023年8月11日 02:01:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76878261.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定