完全取消转义和百分号解码URL

huangapple go评论84阅读模式
英文:

Completely unescape & percent decode URL

问题

我正在处理 RSS 新闻解析器。在内容中,我可以获取到非常不同的 URL:包括转义/非转义或 URL 编码/非 URL 编码的 hrefs:

URL 编码:

https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France

转义:

http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect

未编码且未转义:

https://newsquawk.com/daily/article?id=2490-us-market-open-concerns&utm_source=tradingview&utm_medium=research&utm_campaign=partner-post

此外,RSS 初始内容可能包含一些未编码的 unsafe 符号:

https://www.unsafe.com/a<b>c{d}e[f ]\g^

我需要将所有的 URL 正式地变为“安全”的。似乎唯一的方法是先完全取消转义和解码它们,才能得到正式的安全 URL?


我是否可以以某种方式规范化所有不同的 URL?在 Golang 中是否有一种方法可以完全取消转义和解码 URL?

func(url string) (completelyDecodedUrl string, error) {
    // ??
}
英文:

I am working on RSS news parser. I can get very different URLs in contents: with escaped/ not escaped or url-encoded/not url encoded hrefs:

URL-encoded:

https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France

Escaped:

http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect

Not encoded & not escaped:

https://newsquawk.com/daily/article?id=2490-us-market-open-concerns&utm_source=tradingview&utm_medium=research&utm_campaign=partner-post

Additionally, RSSs initially may contain some uncoded unsafe symbols:

https://www.unsafe.com/a<b>c{d}e[f ]\g^

I need to make all the URLs formally "safe". Seems the only way to get formally safe URL is to completely unescape & decode it first?


Can I somehow normalize all the different URLs? Is there a way to get completely unescaped & decoded URL in golang?

func(url string) (completelyDecodedUrl string, error) {
    // ??
}

答案1

得分: 1

URL编码示例很好,这就是将数据作为URL的一部分传输的方式。如果你需要解码版本,可以解析URL并打印其URL.Fragment字段。

至于第二个问题,只需使用html.Unescape()

例如:

s := "https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France"
u, err := url.Parse(s)
if err != nil {
    panic(err)
}

fmt.Println(u.Fragment)

s2 := "http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect"
fmt.Println(html.UnescapeString(s2))

这将输出(在Go Playground上尝试):

:~:text=La Russie a engrangé 93,qui épingle particulièrement la France
http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect

你不需要解码链接,因为编码形式是有效的。你必须使用编码形式,接收服务器需要解码它。

要检测URL是否已进行HTML转义,可以检查它是否包含分号字符;,因为在URL中分号是保留字符(参见RFC 1738),而HTML转义序列包含分号字符。因此,decode()函数可以如下所示:

func decode(s string) string {
    if strings.IndexByte(s, ';') >= 0 {
        s = html.UnescapeString(s)
    }
    return s
}

如果你担心恶意或无效的URL,可以解析并重新编码URL:

func decode(s string) (string, bool) {
    if strings.IndexByte(s, ';') >= 0 {
        s = html.UnescapeString(s)
    }
    u, err := url.ParseRequestURI(s)
    if err != nil {
        return "", false
    }
    return u.String(), true
}

进行测试:

fmt.Println(decode(`http//foo.bar`))
fmt.Println(decode(`http://foo.bar/doc?query=abc#first`))
fmt.Println(decode(`https://www.unsafe.com/a<b>c{d}e[f ]\g^`))

这将输出(在Go Playground上尝试):

false
http://foo.bar/doc?query=abc#first true
https://www.unsafe.com/a%3Cb%3Ec%7Bd%7De%5Bf%20%5D%5Cg%5E true
英文:

The URL encoded example is good as-is, that's how you transmit data as part of the URL. If you need the decoded version, parse the URL and print its URL.Fragment field.

As to the second, simply use html.Unescape().

For example:

s := "https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France"
u, err := url.Parse(s)
if err != nil {
	panic(err)
}

fmt.Println(u.Fragment)

s2 := "http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect"
fmt.Println(html.UnescapeString(s2))

This will output (try it on the Go Playground):

:~:text=La Russie a engrangé 93,qui épingle particulièrement la France
http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect

You do not need to decode the link, as the encoded form is the valid one. You must use the encoded form, the receiving server is the one who needs to decode it.

To detect if the URL is HTML escaped, you may check if it contains the semicolon character ; as it is reserved in URLs (see RFC 1738), and HTML escape sequences contain the semicolon character. So decode() may look like this:

func decode(s string) string {
	if strings.IndexByte(s, ';') >= 0 {
		s = html.UnescapeString(s)
	}
	return s
}

If you're afraid of malicious or invalid URLs, you may parse and reencode the URL:

func decode(s string) (string, bool) {
	if strings.IndexByte(s, ';') >= 0 {
		s = html.UnescapeString(s)
	}
	u, err := url.ParseRequestURI(s)
	if err != nil {
		return "", false
	}
	return u.String(), true
}

Testing it:

fmt.Println(decode(`http//foo.bar`))
fmt.Println(decode(`http://foo.bar/doc?query=abc#first`))
fmt.Println(decode(`https://www.unsafe.com/a<b>c{d}e[f ]\g^`))

This will output (try it on the Go Playground):

 false
http://foo.bar/doc?query=abc#first true
https://www.unsafe.com/a%3Cb%3Ec%7Bd%7De%5Bf%20%5D%5Cg%5E true

huangapple
  • 本文由 发表于 2022年6月13日 23:52:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/72605675.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定