Golang的html.Parse将href查询字符串重写为包含&。

huangapple go评论90阅读模式
英文:

Golang html.Parse rewriting href query strings to contain &

问题

我有以下代码:

package main

import (
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {
	myHtmlDocument := `<!DOCTYPE html>
<html>
<head>
</head>
<body>
    <a href="http://www.example.com/input?foo=bar&amp;baz=quux">WTF</a>
</body>
</html>`

	doc, _ := html.Parse(strings.NewReader(myHtmlDocument))
	html.Render(os.Stdout, doc)
}

html.Render函数产生以下输出:

<!DOCTYPE html><html><head>

</head>
<body>
    <a href="http://www.example.com/input?foo=bar&amp;amp;baz=quux">WTF</a>

</body></html>

为什么它会重新编写查询字符串,并将&amp;转换为&amp;amp;(在bar和baz之间)?

有没有办法避免这种行为?

我正在尝试进行模板转换,我不希望它破坏我的URL。

英文:

I have the following code:

package main

import (
	&quot;os&quot;
	&quot;strings&quot;

	&quot;golang.org/x/net/html&quot;
)

func main() {
	myHtmlDocument := `&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;a href=&quot;http://www.example.com/input?foo=bar&amp;baz=quux&quot;&gt;WTF&lt;/a&gt;
&lt;/body&gt;
&lt;/html&gt;`

	doc, _ := html.Parse(strings.NewReader(myHtmlDocument))
	html.Render(os.Stdout, doc)
}

The html.Render function is producing the following output:

&lt;!DOCTYPE html&gt;&lt;html&gt;&lt;head&gt;

&lt;/head&gt;
&lt;body&gt;
    &lt;a href=&quot;http://www.example.com/input?foo=bar&amp;amp;baz=quux&quot;&gt;WTF&lt;/a&gt;

&lt;/body&gt;&lt;/html&gt;

Why is it rewriting the query string and converting &amp; to &amp;amp; (in-between bar and baz)?

Is there a way to avoid this behavior?

I'm trying to do template transformation, and I don't want it mangling my URLs.

答案1

得分: 2

html.Parse希望生成有效的HTML,而HTML规范规定,在href属性中的和号(&)必须进行编码。

https://www.w3.org/TR/xhtml1/guidelines.html#C_12

> 在SGML和XML中,和号字符("&")表示实体引用的开始(例如,&reg;表示注册商标符号"®")。不幸的是,许多HTML用户代理在HTML文档中默默地忽略了和号字符的错误用法,将不像实体引用的和号视为字面上的和号。基于XML的用户代理将不容忍这种错误用法,任何使用和号不正确的文档都将不是"有效的",因此也不符合此规范。为了确保文档与历史上的HTML用户代理和基于XML的用户代理兼容,文档中要作为字面字符处理的和号必须表示为实体引用(例如"&amp;")。
> 例如,当a元素的href属性引用一个带参数的CGI脚本时,必须表示为http://my.site.dom/cgi-bin/myscript.pl?class=guest&amp;amp;name=user,而不是http://my.site.dom/cgi-bin/myscript.pl?class=guest&amp;name=user

在这种情况下,Go实际上使您的HTML 更好和有效

话虽如此-浏览器将对其进行解码,因此如果单击它,结果的URL仍将是正确的(不包含&amp;amp;,只有&amp;):

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-js -->

console.log(document.querySelector(&#39;a&#39;).href)

<!-- language: lang-html -->

 &lt;a href=&quot;http://www.example.com/input?foo=bar&amp;amp;baz=quux&quot;&gt;WTF&lt;/a&gt;

<!-- end snippet -->

编辑:由于评论中有人过于追求细节,我要注意一下,在HTML5中,您不再必须转义和号,但仍然始终有效转义它。另一方面,仍然存在一些情况下不转义是无效的-基本上是在和号后面跟着分号但不是命名字符的任何时候:

> 模棱两可的和号是指后面跟着一个或多个ASCII字母数字字符,然后是一个U+003B分号字符(;)的U+0026和号字符(&),其中这些字符与命名字符引用部分中给出的任何名称都不匹配。

这意味着像这样的链接:

&lt;a href=&quot;http://www.example.com/input?foo=bar&amp;a;&amp;baz=quux&quot;&gt;WTF&lt;/a&gt;

将是无效的,但如果是

&lt;a href=&quot;http://www.example.com/input?foo=bar&amp;amp;a;&amp;baz=quux&quot;&gt;WTF&lt;/a&gt;

它将是有效的。

因此,解析器遵循一条更简单的规则,适用于所有版本的HTML,使您的HTML 更好仍然有效

英文:

html.Parse wants to generate valid HTML, and the HTML spec states that an amperstand in a href attribute must be encoded.

https://www.w3.org/TR/xhtml1/guidelines.html#C_12

> In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., &reg; for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&amp;").
> For example, when the href attribute of the a element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&amp;amp;name=user rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&amp;name=user.

In this case, go is actually making your HTML better and valid

With that being said - browsers will unescape it, so the resulting url if it were to be clicked on would still be the correct one (without the &amp;amp;, just the &amp;:

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-js -->

console.log(document.querySelector(&#39;a&#39;).href)

<!-- language: lang-html -->

 &lt;a href=&quot;http://www.example.com/input?foo=bar&amp;amp;baz=quux&quot;&gt;WTF&lt;/a&gt;

<!-- end snippet -->

EDIT: Since people are being pedentic in the comments, I'll note that in HTML5 you are not required to escape the ampersand anymore, however it still always valid to escape it. On the otherhand, there are still situations in which it is invalid not to - essentially anytime the ampersand is followed by a semicolon but is not a named character:

> An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by one or more ASCII alphanumerics, followed by a U+003B SEMICOLON character (;), where these characters do not match any of the names given in the named character references section.

which means that a link like:

&lt;a href=&quot;http://www.example.com/input?foo=bar&amp;a;&amp;baz=quux&quot;&gt;WTF&lt;/a&gt;

would be invalid, yet if it were

&lt;a href=&quot;http://www.example.com/input?foo=bar&amp;amp;a;&amp;baz=quux&quot;&gt;WTF&lt;/a&gt;

it would be valid.

So the parser sticks to a rule that is simpler to implement, and works in all versions of HTML, to make your HTML better and still valid.

huangapple
  • 本文由 发表于 2022年12月14日 06:31:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/74791585.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定