英文:
Completely unescape & percent decode URL
问题
我正在处理 RSS 新闻解析器。在内容中,我可以获取到非常不同的 URL:包括转义/非转义或 URL 编码/非 URL 编码的 hrefs:
URL 编码:
https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France
转义:
http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect
未编码且未转义:
https://newsquawk.com/daily/article?id=2490-us-market-open-concerns&utm_source=tradingview&utm_medium=research&utm_campaign=partner-post
此外,RSS 初始内容可能包含一些未编码的 unsafe 符号:
https://www.unsafe.com/a<b>c{d}e[f ]\g^
我需要将所有的 URL 正式地变为“安全”的。似乎唯一的方法是先完全取消转义和解码它们,才能得到正式的安全 URL?
我是否可以以某种方式规范化所有不同的 URL?在 Golang 中是否有一种方法可以完全取消转义和解码 URL?
func(url string) (completelyDecodedUrl string, error) {
// ??
}
英文:
I am working on RSS news parser. I can get very different URLs in contents: with escaped/ not escaped or url-encoded/not url encoded hrefs:
URL-encoded:
https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France
Escaped:
http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&amp;mid=2658568&amp;idx=1&amp;sn=b50084652c901&amp;chksm=f0cb0fabcee7d4&amp;scene=21#wechat_redirect
Not encoded & not escaped:
https://newsquawk.com/daily/article?id=2490-us-market-open-concerns&utm_source=tradingview&utm_medium=research&utm_campaign=partner-post
Additionally, RSSs initially may contain some uncoded unsafe symbols:
https://www.unsafe.com/a<b>c{d}e[f ]\g^
I need to make all the URLs formally "safe". Seems the only way to get formally safe URL is to completely unescape & decode it first?
Can I somehow normalize all the different URLs? Is there a way to get completely unescaped & decoded URL in golang?
func(url string) (completelyDecodedUrl string, error) {
// ??
}
答案1
得分: 1
URL编码示例很好,这就是将数据作为URL的一部分传输的方式。如果你需要解码版本,可以解析URL并打印其URL.Fragment
字段。
至于第二个问题,只需使用html.Unescape()
。
例如:
s := "https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France"
u, err := url.Parse(s)
if err != nil {
panic(err)
}
fmt.Println(u.Fragment)
s2 := "http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&amp;mid=2658568&amp;idx=1&amp;sn=b50084652c901&amp;chksm=f0cb0fabcee7d4&amp;scene=21#wechat_redirect"
fmt.Println(html.UnescapeString(s2))
这将输出(在Go Playground上尝试):
:~:text=La Russie a engrangé 93,qui épingle particulièrement la France
http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect
你不需要解码链接,因为编码形式是有效的。你必须使用编码形式,接收服务器需要解码它。
要检测URL是否已进行HTML转义,可以检查它是否包含分号字符;
,因为在URL中分号是保留字符(参见RFC 1738),而HTML转义序列包含分号字符。因此,decode()
函数可以如下所示:
func decode(s string) string {
if strings.IndexByte(s, ';') >= 0 {
s = html.UnescapeString(s)
}
return s
}
如果你担心恶意或无效的URL,可以解析并重新编码URL:
func decode(s string) (string, bool) {
if strings.IndexByte(s, ';') >= 0 {
s = html.UnescapeString(s)
}
u, err := url.ParseRequestURI(s)
if err != nil {
return "", false
}
return u.String(), true
}
进行测试:
fmt.Println(decode(`http//foo.bar`))
fmt.Println(decode(`http://foo.bar/doc?query=abc#first`))
fmt.Println(decode(`https://www.unsafe.com/a<b>c{d}e[f ]\g^`))
这将输出(在Go Playground上尝试):
false
http://foo.bar/doc?query=abc#first true
https://www.unsafe.com/a%3Cb%3Ec%7Bd%7De%5Bf%20%5D%5Cg%5E true
英文:
The URL encoded example is good as-is, that's how you transmit data as part of the URL. If you need the decoded version, parse the URL and print its URL.Fragment
field.
As to the second, simply use html.Unescape()
.
For example:
s := "https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France"
u, err := url.Parse(s)
if err != nil {
panic(err)
}
fmt.Println(u.Fragment)
s2 := "http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&amp;mid=2658568&amp;idx=1&amp;sn=b50084652c901&amp;chksm=f0cb0fabcee7d4&amp;scene=21#wechat_redirect"
fmt.Println(html.UnescapeString(s2))
This will output (try it on the Go Playground):
:~:text=La Russie a engrangé 93,qui épingle particulièrement la France
http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect
You do not need to decode the link, as the encoded form is the valid one. You must use the encoded form, the receiving server is the one who needs to decode it.
To detect if the URL is HTML escaped, you may check if it contains the semicolon character ;
as it is reserved in URLs (see RFC 1738), and HTML escape sequences contain the semicolon character. So decode()
may look like this:
func decode(s string) string {
if strings.IndexByte(s, ';') >= 0 {
s = html.UnescapeString(s)
}
return s
}
If you're afraid of malicious or invalid URLs, you may parse and reencode the URL:
func decode(s string) (string, bool) {
if strings.IndexByte(s, ';') >= 0 {
s = html.UnescapeString(s)
}
u, err := url.ParseRequestURI(s)
if err != nil {
return "", false
}
return u.String(), true
}
Testing it:
fmt.Println(decode(`http//foo.bar`))
fmt.Println(decode(`http://foo.bar/doc?query=abc#first`))
fmt.Println(decode(`https://www.unsafe.com/a<b>c{d}e[f ]\g^`))
This will output (try it on the Go Playground):
false
http://foo.bar/doc?query=abc#first true
https://www.unsafe.com/a%3Cb%3Ec%7Bd%7De%5Bf%20%5D%5Cg%5E true
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论