2023年8月11日 02:01:36go评论90阅读模式

英文:

Getting an element's subtending text nodes (including indirect)

问题

我们需要检索元素的所有子文本节点，无论是直接的还是间接的。Element 类上的 textNodes() 方法仅返回直接子文本节点，而不返回更深层次的子文本节点，如子的子文本节点、子的子的子文本节点等。

给定以下示例HTML文件：

<html>
    <head/>
    <body>
        <div class="erece mtmhp">
            <a href="http://www.stackoverflow.com">
                <span>Content 1</span>
                <span>Content 2</span>
            </a>
        </div>
    </body>
</html>

我想要能够分别检索 Content 1 和 Content 2。

以下是我的示例代码：

import java.io.File;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;

public class TextNodeExpt {
    public static void main(String[] args) throws Exception {
        File fileObj = new File(args[0]);
        Document document = Jsoup.parse(fileObj, "UTF-8");
        Elements divs = document.select("div.erece.mtmhp");
        displaySubtendingTextNodes(divs);
    }

    protected static void displaySubtendingTextNodes(Elements divs) {
        for (Element div : divs) {
            Elements anchors = div.select("a");
            for (Element anchor : anchors) {
                List<TextNode> textNodes = anchor.textNodes();
                System.out.println(textNodes.size() + " text nodes found");
                for (TextNode tn : textNodes) {
                    System.out.println("[" + tn.getWholeText() + "]");
                }
            }
        }
    }
}

补充说明：根据 Hovercraft Full Of Eels 的评论，我已经创建了以下实现。它基于此 Stackoverflow 帖子。

我使用了 API 中的 List<TextNode>，类似于 Element 类上的 textNodes() 方法（尽管列表作为参数传递，而不是从方法返回）。

protected static void textNodes(Node targetNode, List<TextNode> nodeList) {
    for (Node childNode : targetNode.childNodes()) {
        if (childNode instanceof TextNode && !((TextNode) childNode).isBlank()) {
            nodeList.add((TextNode) childNode);
        } else {
            textNodes(childNode, nodeList);
        }
    }
}

不含代码部分的翻译已提供。

英文:

We have a need to retrieve all the subtending text nodes of an element, whether direct or indirect. The textNodes() method on the Element class is returning only the direct child text nodes, and not the grand text nodes, great-grand text nodes etc.

Given the following sample HTML file:

&lt;html&gt;
	&lt;head/&gt;
	&lt;body&gt;
		&lt;div class=&quot;erece mtmhp&quot;&gt;
			&lt;a href=&quot;http://www.stackoverflow.com&quot;&gt;
				&lt;span&gt;Content 1&lt;/span&gt;
				&lt;span&gt;Content 2&lt;/span&gt;
			&lt;/a&gt;
		&lt;/div&gt;
	&lt;/body&gt;
&lt;/html&gt;

I would like to be able to retrieve Content 1 and Content 2, but separately.

Here is my sample code:

import java.io.File;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;

public class TextNodeExpt {
    public static void main(String[] args) throws Exception {

		File fileObj = new File(args[0]);
        Document document = Jsoup.parse(fileObj, &quot;UTF-8&quot;);

        Elements divs = document.select(&quot;div.erece.mtmhp&quot;);

        displaySubtendingTextNodes(divs);
    }

	protected static void displaySubtendingTextNodes(Elements divs) {
		for (Element div : divs) {
            Elements anchors = div.select(&quot;a&quot;);
            for (Element anchor : anchors) {
				List&lt;TextNode&gt; textNodes = anchor.textNodes();
				System.out.println(textNodes.size() + &quot; text nodes found&quot;);
				for(TextNode tn : textNodes) {
					System.out.println(&quot;[&quot; + tn.getWholeText() + &quot;]&quot;);
					}
            }
        }
	}
}

Addendum:
Based on the comment from Hovercraft Full Of Eels, I have come up with the following implementation. It is based on this Stackoverflow posting.

I have made use of List<TextNode> in the API, similar to the textNodes() method on the Element class (although the list is passed as a parameter, and not returned from the method).

protected static void textNodes(Node targetNode, List&lt;TextNode&gt; nodeList) {
    for (Node childNode : targetNode.childNodes()) {
        if (childNode instanceof TextNode &amp;&amp; !((TextNode) childNode).isBlank()) {
        	nodeList.add((TextNode)childNode);
        }
        else {
        	textNodes(childNode, nodeList);
        }
    }
}

答案1

得分: 1

以下是翻译好的部分：

"retrieve separately" - 但是，如果您想要控制和自定义从元素接收文本的方式，那么您应该创建一个实现NodeVisitor接口的单独类，并定义提取方法，以便将来不难分别操作"Content 1"和"Content 2"，甚至将这些行放入集合中。

Element targetElement = doc.select("div.erece").first();
TextNodeVisitor textNodeVisitor = new TextNodeVisitor();
NodeTraversor traversor = new NodeTraversor(textNodeVisitor);
traversor.traverse(targetElement);
String extractedText = textNodeVisitor.getExtractedText();
System.out.println(extractedText);
}

static class TextNodeVisitor implements NodeVisitor {
    private StringBuilder extractedText = new StringBuilder();
    @Override
    public void head(Node node, int depth) {
        if (node instanceof org.jsoup.nodes.TextNode) {
            org.jsoup.nodes.TextNode textNode = (org.jsoup.nodes.TextNode) node;
            String text = textNode.text().trim();
            if (!text.isEmpty()) {
                extractedText.append(text).append("\n");
            }
        }
    }

    @Override
    public void tail(Node node, int depth) {
        // Do nothing on tail
    }

    public String getExtractedText() {
        return extractedText.toString();
    }

    public List<String> getLineList(){
        List<String> stringList = List.of(extractedText.toString().split("\n"));
        return stringList;
    }
}

希望这对您有所帮助。

英文:

It's not entirely clear what you mean - "retrieve separately" - However, if you want to control and customize the way the text is received from the element, then you should create a separate class implementing the NodeVisitor interface and define the extraction method so that it would not be challenging to manipulate strings in the future "Content 1" and "Content 2" separately, or even put such lines into a collection.

      Element targetElement = doc.select(&quot;div.erece&quot;).first();
    TextNodeVisitor textNodeVisitor = new TextNodeVisitor();
    NodeTraversor traversor = new NodeTraversor(textNodeVisitor);
        traversor.traverse(targetElement);
    String extractedText = textNodeVisitor.getExtractedText();
        System.out.println(extractedText);
}
static class TextNodeVisitor implements NodeVisitor {
    private StringBuilder extractedText = new StringBuilder();
    @Override
    public void head(Node node, int depth) {
        if (node instanceof org.jsoup.nodes.TextNode) {
            org.jsoup.nodes.TextNode textNode = (org.jsoup.nodes.TextNode) node;
            String text = textNode.text().trim();
            if (!text.isEmpty()) {
                extractedText.append(text).append(&quot;\n&quot;);
            }
        }
    }

    @Override
    public void tail(Node node, int depth) {
        // Do nothing on tail
    }

    public String getExtractedText() {
        return extractedText.toString();
    }

    public List&lt;String&gt; getLineList(){
        List&lt;String&gt; stringList = List.of(extractedText.toString().split(&quot;\n&quot;));
        return stringList;
    }

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

获取元素的子文本节点（包括间接的）

问题

答案1

如何在PUT方法中只发送一个参数？

如何在Java中将数组中的整行设为一个值？

如何在不写入文件的情况下读取压缩输入流？

如何在JAVA中在非常精确的时间发送HTTP请求。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论