Jsoup – `

huangapple go评论86阅读模式
英文:

Jsoup - <noscript> content in <head> gets interpreted as text

问题

我遇到了一个问题,即将以下HTML结果解析为不需要的结果。

HTML部分:

&lt;html&gt;
&lt;head&gt;
&lt;title&gt;Try jsoup&lt;/title&gt;
&lt;noscript&gt;&lt;p&gt;thisisatest&lt;/p&gt;&lt;/noscript&gt;
&lt;noscript&gt;&lt;img id=&quot;tracking-test-noscript&quot; style=&quot;width: 1px; height: 1px&quot; src=&quot;http://fullwithsheep/img/tracking3.jpg&quot;&gt;&lt;/noscript&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;noscript&gt;&lt;p&gt;thisisatest&lt;/p&gt;&lt;/noscript&gt;
&lt;p&gt;This is &lt;a href=&quot;http://jsoup.org/&quot;&gt;jsoup&lt;/a&gt;.&lt;/p&gt;
&lt;noscript&gt;&lt;img id=&quot;tracking-test-noscript&quot; style=&quot;width: 1px; height: 1px&quot; src=&quot;http://fullwithsheep/img/tracking3.jpg&quot;&gt;&lt;/noscript&gt;
&lt;/body&gt;
&lt;/html&gt;

JSOUP对文档的解释:

&lt;html&gt;
&lt;head&gt;
&lt;title&gt;Try jsoup&lt;/title&gt;
&lt;noscript&gt;&amp;lt;p&amp;gt;thisisatest&lt;/noscript&gt;
&lt;noscript&gt;&amp;lt;img  id=&quot;tracking-test-noscript&quot; style=&quot;width: 1px; height: 1px&quot; src=&quot;http://fullwithsheep/img/tracking3.jpg&quot;&amp;gt;&lt;/noscript&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;noscript&gt;&lt;p&gt;thisisatest&lt;/p&gt;&lt;/noscript&gt;
&lt;p&gt;This is &lt;a href=&quot;http://jsoup.org/&quot;&gt;jsoup&lt;/a&gt;.&lt;/p&gt;
&lt;noscript&gt;&lt;img id=&quot;tracking-test-noscript&quot; style=&quot;width: 1px; height: 1px&quot; src=&quot;http://fullwithsheep/img/tracking3.jpg&quot;&gt;&lt;/noscript&gt;
&lt;/body&gt;&lt;/html&gt;

你可以看到头节点内的noscript标签的innerHTML被解释为文本 - 我想要的是jsoup仍然将它们解释为HTML,而不是文本(不要将<解析为&amp;lt;等)。

如图所示,我所做的解决此问题的方法是在中断Jsoup.parse后选择所有noscript标签,并尝试将相应noscript标签的文本转换回HTML。然而,这感觉不是正确的做法 - 这是Jsoup库内部的一个错误还是有意为之的行为?

英文:

I came up with a problem that parsing the following HTML results into unwanted result.

The HTML

&lt;html&gt;
&lt;head&gt;
&lt;title&gt;Try jsoup&lt;/title&gt;
&lt;noscript&gt;&lt;p&gt;thisisatest&lt;/p&gt;&lt;/noscript&gt;
&lt;noscript&gt;&lt;img id=&quot;tracking-test-noscript&quot; style=&quot;width: 1px; height: 1px&quot; src=&quot;http://fullwithsheep/img/tracking3.jpg&quot;&gt;&lt;/noscript&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;noscript&gt;&lt;p&gt;thisisatest&lt;/p&gt;&lt;/noscript&gt;
&lt;p&gt;This is &lt;a href=&quot;http://jsoup.org/&quot;&gt;jsoup&lt;/a&gt;.&lt;/p&gt;
&lt;noscript&gt;&lt;img id=&quot;tracking-test-noscript&quot; style=&quot;width: 1px; height: 1px&quot; src=&quot;http://fullwithsheep/img/tracking3.jpg&quot;&gt;&lt;/noscript&gt;
&lt;/body&gt;
&lt;/html&gt;

JSOUP interpretation of the Document

&lt;html&gt;
&lt;head&gt;
&lt;title&gt;Try jsoup&lt;/title&gt;
&lt;noscript&gt;&amp;lt;p&amp;gt;thisisatest&lt;/noscript&gt;
&lt;noscript&gt;&amp;lt;img  id=&quot;tracking-test-noscript&quot; style=&quot;width: 1px; height: 1px&quot; src=&quot;http://fullwithsheep/img/tracking3.jpg&quot;&amp;gt;&lt;/noscript&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;noscript&gt;&lt;p&gt;thisisatest&lt;/p&gt;&lt;/noscript&gt;
&lt;p&gt;This is &lt;a href=&quot;http://jsoup.org/&quot;&gt;jsoup&lt;/a&gt;.&lt;/p&gt;
&lt;noscript&gt;&lt;img id=&quot;tracking-test-noscript&quot; style=&quot;width: 1px; height: 1px&quot; src=&quot;http://fullwithsheep/img/tracking3.jpg&quot;&gt;&lt;/noscript&gt;
&lt;/body&gt;&lt;/html&gt;

Jsoup – `<noscript>`标签在`<head>`中的内容被解释为文本

As you can see the innerHTML from the noscript tags within head node where interpreted as text - what I want is that jsoup still will interpret them as html instead of text (without sanitizing < into &amp;lt; and so on)

What I did as a fix to this problem as a workaround is selecting all noscript tags after interrupting Jsoup.parse and try to transform the text of the respective noscript tag back to html. However, this feels like it is not the right way to do it - Is this a bug within Jsoup Library or is this behaviour intentioned?

答案1

得分: 0

使用xmlParser以避免不必要的HTML修改:

Document doc = Jsoup.parse(html, "", Parser.xmlParser());

默认解析器“将输入视为HTML5,并强制创建基于对传入标签语义的了解的规范化文档”,
而xmlParser“假设对传入标签没有了解,并且不将其视为HTML,而是直接从输入创建一个简单的树”,这正是您需要的。

引用来自文档:https://jsoup.org/apidocs/org/jsoup/parser/Parser.html#xmlParser()

英文:

Use xmlParser to avoid undesired HTML modifications:

Document doc = Jsoup.parse(html, &quot;&quot;, Parser.xmlParser());

Default parser treats input as HTML5, and enforces the creation of a normalised document, based on a knowledge of the semantics of the incoming tags
while xmlParser assumes no knowledge of the incoming tags and does not treat it as HTML, rather creates a simple tree directly from the input and that's what you need.

Quotes come from the documentation: https://jsoup.org/apidocs/org/jsoup/parser/Parser.html#xmlParser()

huangapple
  • 本文由 发表于 2020年8月18日 15:27:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/63463760.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定