英文:
Jsoup - <noscript> content in <head> gets interpreted as text
问题
我遇到了一个问题,即将以下HTML结果解析为不需要的结果。
HTML部分:
<html>
<head>
<title>Try jsoup</title>
<noscript><p>thisisatest</p></noscript>
<noscript><img id="tracking-test-noscript" style="width: 1px; height: 1px" src="http://fullwithsheep/img/tracking3.jpg"></noscript>
</head>
<body>
<noscript><p>thisisatest</p></noscript>
<p>This is <a href="http://jsoup.org/">jsoup</a>.</p>
<noscript><img id="tracking-test-noscript" style="width: 1px; height: 1px" src="http://fullwithsheep/img/tracking3.jpg"></noscript>
</body>
</html>
JSOUP对文档的解释:
<html>
<head>
<title>Try jsoup</title>
<noscript>&lt;p&gt;thisisatest</noscript>
<noscript>&lt;img id="tracking-test-noscript" style="width: 1px; height: 1px" src="http://fullwithsheep/img/tracking3.jpg"&gt;</noscript>
</head>
<body>
<noscript><p>thisisatest</p></noscript>
<p>This is <a href="http://jsoup.org/">jsoup</a>.</p>
<noscript><img id="tracking-test-noscript" style="width: 1px; height: 1px" src="http://fullwithsheep/img/tracking3.jpg"></noscript>
</body></html>
你可以看到头节点内的noscript标签的innerHTML被解释为文本 - 我想要的是jsoup仍然将它们解释为HTML,而不是文本(不要将<解析为&lt;
等)。
如图所示,我所做的解决此问题的方法是在中断Jsoup.parse后选择所有noscript标签,并尝试将相应noscript标签的文本转换回HTML。然而,这感觉不是正确的做法 - 这是Jsoup库内部的一个错误还是有意为之的行为?
英文:
I came up with a problem that parsing the following HTML results into unwanted result.
The HTML
<html>
<head>
<title>Try jsoup</title>
<noscript><p>thisisatest</p></noscript>
<noscript><img id="tracking-test-noscript" style="width: 1px; height: 1px" src="http://fullwithsheep/img/tracking3.jpg"></noscript>
</head>
<body>
<noscript><p>thisisatest</p></noscript>
<p>This is <a href="http://jsoup.org/">jsoup</a>.</p>
<noscript><img id="tracking-test-noscript" style="width: 1px; height: 1px" src="http://fullwithsheep/img/tracking3.jpg"></noscript>
</body>
</html>
JSOUP interpretation of the Document
<html>
<head>
<title>Try jsoup</title>
<noscript>&lt;p&gt;thisisatest</noscript>
<noscript>&lt;img id="tracking-test-noscript" style="width: 1px; height: 1px" src="http://fullwithsheep/img/tracking3.jpg"&gt;</noscript>
</head>
<body>
<noscript><p>thisisatest</p></noscript>
<p>This is <a href="http://jsoup.org/">jsoup</a>.</p>
<noscript><img id="tracking-test-noscript" style="width: 1px; height: 1px" src="http://fullwithsheep/img/tracking3.jpg"></noscript>
</body></html>
As you can see the innerHTML from the noscript tags within head node where interpreted as text - what I want is that jsoup still will interpret them as html instead of text (without sanitizing < into &lt;
and so on)
What I did as a fix to this problem as a workaround is selecting all noscript tags after interrupting Jsoup.parse and try to transform the text of the respective noscript tag back to html. However, this feels like it is not the right way to do it - Is this a bug within Jsoup Library or is this behaviour intentioned?
答案1
得分: 0
使用xmlParser
以避免不必要的HTML修改:
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
默认解析器“将输入视为HTML5,并强制创建基于对传入标签语义的了解的规范化文档”,
而xmlParser“假设对传入标签没有了解,并且不将其视为HTML,而是直接从输入创建一个简单的树”,这正是您需要的。
引用来自文档:https://jsoup.org/apidocs/org/jsoup/parser/Parser.html#xmlParser()
英文:
Use xmlParser
to avoid undesired HTML modifications:
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Default parser treats input as HTML5, and enforces the creation of a normalised document, based on a knowledge of the semantics of the incoming tags
while xmlParser assumes no knowledge of the incoming tags and does not treat it as HTML, rather creates a simple tree directly from the input
and that's what you need.
Quotes come from the documentation: https://jsoup.org/apidocs/org/jsoup/parser/Parser.html#xmlParser()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论