在Go语言中如何选择URL编码标准?

huangapple go评论77阅读模式
英文:

How to choose URL encoding standard in Go?

问题

我有一个使用Go语言编写的客户端,与遵循RFC 1738 URL编码规则的服务器进行通信。RFC 1738已经被RFC 3986更新(替代),至少在Go的v1.17.7版本中使用的是RFC 3986。

在RFC 1738中,~是一个保留的("不安全")字符,应该被编码为%7E,而在RFC 3986中,不需要对~进行编码。这只是这两个RFC之间的一个区别,可能还有其他我尚未研究的区别,这就是为什么我不想采用简单地将~替换为%7E的方法。

我能否让Go创建一个"与RFC 1738兼容"的编码URL?如果不能,是否有第三方库可以做到这一点,也许可以接受RFC编号作为参数?time已经可以做到这一点:

t.Format(time.RFC822)
t.Format(time.RFC850)
t.Format(time.RFC1123)
t.Format(time.RFC3339)
英文:

I have a Go client that is communicating with a server that follows RFC 1738 URL encoding rules. RFC 1738 has since been updated (replaced) by RFC 3986, which is what Go seems to be using, at least in v1.17.7.

s := "blue+~light blue"
s = url.QueryEscape(s)
fmt.Println(s) // blue%2B~light+blue

In RFC 1738, ~ is a reserved ("unsafe") character and should be encoded as %7E, whereas in RFC 3986 it's not necessary to encode ~. This is just one difference between the two RFCs, there are likely others that I've not looked into yet, which is why a naive approach of replacing ~ with %7E isn't the path I want to go down.

Can I make Go create an "RFC 1738 compatible" encoded URL? If not, are there third-party libraries that can do this, perhaps by accepting an RFC number parameter? time already does this:

t.Format(time.RFC822)
t.Format(time.RFC850)
t.Format(time.RFC1123)
t.Format(time.RFC3339)

答案1

得分: 1

在RFC 1738中,~是一个保留的("不安全")字符,应该被编码为%7E。

~并不是保留字符,在URI中没有特殊含义。

RFC 1738中的保留字符包括:;/?:@&=。RFC 3986中的保留字符包括::/?#[]@!$&'()*+,;=。RFC 3986的保留字符集包含了RFC 1738的所有保留字符,它是一个超集。

不安全字符是不同的,RFC 3986出于充分的理由摒弃了不安全字符。

RFC 1738将字符标记为"不安全"是因为它们对于其他编码可能具有特殊含义。

  • 空格字符是不安全的,因为在URL被转录、排版或经过文字处理程序处理时,重要的空格可能会消失,而无关紧要的空格可能会引入。
  • 字符<>是不安全的,因为它们在自由文本中用作URL的定界符。
  • 引号字符"在某些系统中用于定界URL。
  • 字符#是不安全的,应该始终进行编码,因为它在万维网和其他系统中用于将URL与可能跟随其后的片段/锚点标识符分隔开。
  • 字符%是不安全的,因为它用于对其他字符进行编码。
  • 其他字符是不安全的,因为网关和其他传输代理有时会修改这些字符。这些字符包括{}|\^~[]和``。

在1994年这可能是有道理的,当时URL被期望在文本中自由嵌入,而且网关和其他传输代理有时会修改这些字符。但是在现在的2022年,"网关和其他传输代理有时会修改这些字符"早已不再使用。

现在已经确立了使用文本的实体自行进行转义的责任,因此RFC 3986摒弃了不安全字符。RFC的任务不是猜测其他编码可能使用的特殊字符。使用URI的实体有责任根据其规则进行转义和编码。如果它们没有这样做,那就是一个错误,对于它们来说可能是一个安全问题。


由于~并不是保留字符,即使在17年前的RFC 3986之前的代码中也是如此,URL中的%7E~都会被解读为~

如果~对它有特殊含义,并且它不进行自己的转义,那么它很可能在许多其他方面存在问题,并且存在安全性问题。它可能也无法处理UTF-8编码。

英文:

> In RFC 1738, ~ is a reserved ("unsafe") character and should be encoded as %7E

~ is not reserved. It has no special meaning in a URI.

The reserved characters in 1738 are: ";" | "/" | "?" | ":" | "@" | "&" | "=". The reserved characters in 3986 are: ":" / "/" / "?" / "#" / "[" / "]" / "@" / "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=". The 3986 reserved set contains all the characters of the 1738 reserved set. It is a superset.

Unsafe is different, and RFC 3986 got rid of unsafe for good reason.

RFC 1738 makes characters "unsafe" because they may have special meaning to other encodings.

> * The space character is unsafe because significant spaces may disappear and
insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
> * The characters "<" and ">" are unsafe because they are used as the
delimiters around URLs in free text.
> * The quote mark (""") is used to delimit URLs in some systems.
> * The character "#" is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it.
> * The character "%" is unsafe because it is used for encodings of other characters.
> * Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "&quot;, "^", "~", "[", "]", and "`".

That might have made sense in 1994 when things were much more lax and URLs were expected to be embedded freely in text, but here in 2022 "gateways and other transport agents are known to sometimes modify such characters" has long since been put out of use.

Nowadays it's well-established that it's the responsibility of the thing using the text to do its own escaping, so RFC 3986 got rid of unsafe characters. It's not the RFC's job to guess what other encodings might use as special characters. The thing consuming your URI has the responsibility to escape and encode it according to its rules. If they don't, that's a bug and a possibly a security problem for them.


Since ~ is not reserved; even if you encounter pre-3986 code, which was 17 years ago, it will read both %7E and ~ in a URL as ~.

If ~ has special meaning to it and it doesn't do its own escaping it's likely very broken and insecure in many other ways. It will probably also choke on UTF-8.

答案2

得分: 1

Go语言没有为url.QueryEscape提供任何参数。但是,很容易为您的情况编写一个自定义的转义器。

首先,声明一个表格,其中包含在结果中应保留不变的字节:

// 如果b在RFC 1738和HTML5表单值的允许字节的交集中,则noEscape[b]为true。请注意,RFC 1738删除了HTML 5允许的一个字节--'~'。
var noEscape = [256]bool{
'A': true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
'a': true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
'0': true, true, true, true, true, true, true, true, true, true,
'-': true,
'_': true,
'.': true,
}

以下是该函数:

// queryEscape1738将字符串转义,以便可以安全地放置在查询中。
func queryEscape1738(s string) string {
percent := 0 // 需要进行%编码的字节数
plus := false // 是否需要对空格进行+编码?
for i := 0; i < len(s); i++ {
b := s[i]
if b == ' ' {
plus = true
} else if !noEscape[b] {
percent++
}
}

// 无需进行转义?
if percent == 0 && !plus {
    return s
}

// 进行编码!
p := make([]byte, 0, len(s)+2*percent)
for i := 0; i < len(s); i++ {
    b := s[i]
    if b == ' ' {
        p = append(p, '+')
    } else if noEscape[b] {
        p = append(p, b)
    } else {
        p = append(p, '%', "0123456789ABCDEF"[b>>4], "0123456789ABCDEF"[b&15])
    }
}
return string(p)

}

尽管如此,服务器可能并不关心是否对~进行编码。通常的解码器将+转换为空格,将%xx转换为解码后的十六进制值,而其他所有字节值都按原样使用。

英文:

Go does not provide any knobs for url.QueryEscape. It is easy enough to whip up a custom escaper for your scenario.

Start by declaring a table of the bytes that should be left as is in the result:

// noEscape[b] is true if b is in the intersection of the allowed
// bytes in RFC 1738 and HTML5 form values.  Note that RFC 1738 
// removes one byte allowed by HTML 5 -- &#39;~&#39;.
var noEscape = [256]bool{
	&#39;A&#39;: true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
	&#39;a&#39;: true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
	&#39;0&#39;: true, true, true, true, true, true, true, true, true, true,
	&#39;-&#39;: true,
	&#39;_&#39;: true,
	&#39;.&#39;: true,
}

Here's the function:

// queryEscape1738 escapes the string so it can be safely 
// placed inside a query.
func queryEscape1738(s string) string {
	percent := 0  // number of bytes to % encode
	plus := false // do we need to + encode space?
	for i := 0; i &lt; len(s); i++ {
		b := s[i]
		if b == &#39; &#39; {
			plus = true
		} else if !noEscape[b] {
			percent++
		}
	}

	// Nothing to do?
	if percent == 0 &amp;&amp; !plus {
		return s
	}

    // Encode!
	p := make([]byte, 0, len(s)+2*percent)
	for i := 0; i &lt; len(s); i++ {
		b := s[i]
		if b == &#39; &#39; {
			p = append(p, &#39;+&#39;)
		} else if noEscape[b] {
			p = append(p, b)
		} else {
			p = append(p, &#39;%&#39;, &quot;0123456789ABCDEF&quot;[b&gt;&gt;4], &quot;0123456789ABCDEF&quot;[b&amp;15])
		}
	}
	return string(p)
}

All that said, it's unlikely that the server cares whether ~ is encoded or not. The typical decoder converts + to space, %xx to the decoded hex value and all other byte values are used as is.

huangapple
  • 本文由 发表于 2022年2月15日 07:48:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/71119582.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定