2022年6月13日 23:52:48go评论93阅读模式

英文:

Completely unescape & percent decode URL

问题

我正在处理 RSS 新闻解析器。在内容中，我可以获取到非常不同的 URL：包括转义/非转义或 URL 编码/非 URL 编码的 hrefs：

URL 编码：

https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France

转义：

http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&amp;amp;mid=2658568&amp;amp;idx=1&amp;amp;sn=b50084652c901&amp;amp;chksm=f0cb0fabcee7d4&amp;amp;scene=21#wechat_redirect

未编码且未转义：

https://newsquawk.com/daily/article?id=2490-us-market-open-concerns&amp;utm_source=tradingview&amp;utm_medium=research&amp;utm_campaign=partner-post

此外，RSS 初始内容可能包含一些未编码的 unsafe 符号：

https://www.unsafe.com/a&lt;b&gt;c{d}e[f ]\g^

我需要将所有的 URL 正式地变为“安全”的。似乎唯一的方法是先完全取消转义和解码它们，才能得到正式的安全 URL？

我是否可以以某种方式规范化所有不同的 URL？在 Golang 中是否有一种方法可以完全取消转义和解码 URL？

func(url string) (completelyDecodedUrl string, error) {
    // ??
}

英文:

I am working on RSS news parser. I can get very different URLs in contents: with escaped/ not escaped or url-encoded/not url encoded hrefs:

URL-encoded:

https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France

Escaped:

http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&amp;amp;mid=2658568&amp;amp;idx=1&amp;amp;sn=b50084652c901&amp;amp;chksm=f0cb0fabcee7d4&amp;amp;scene=21#wechat_redirect

Not encoded & not escaped:

https://newsquawk.com/daily/article?id=2490-us-market-open-concerns&amp;utm_source=tradingview&amp;utm_medium=research&amp;utm_campaign=partner-post

Additionally, RSSs initially may contain some uncoded unsafe symbols:

https://www.unsafe.com/a&lt;b&gt;c{d}e[f ]\g^

I need to make all the URLs formally "safe". Seems the only way to get formally safe URL is to completely unescape & decode it first?

Can I somehow normalize all the different URLs? Is there a way to get completely unescaped & decoded URL in golang?

func(url string) (completelyDecodedUrl string, error) {
    // ??
}

答案1

得分: 1

URL编码示例很好，这就是将数据作为URL的一部分传输的方式。如果你需要解码版本，可以解析URL并打印其URL.Fragment字段。

至于第二个问题，只需使用html.Unescape()。

例如：

s := "https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France"
u, err := url.Parse(s)
if err != nil {
    panic(err)
}

fmt.Println(u.Fragment)

s2 := "http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&amp;amp;mid=2658568&amp;amp;idx=1&amp;amp;sn=b50084652c901&amp;amp;chksm=f0cb0fabcee7d4&amp;amp;scene=21#wechat_redirect"
fmt.Println(html.UnescapeString(s2))

这将输出（在Go Playground上尝试）：

:~:text=La Russie a engrangé 93,qui épingle particulièrement la France
http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect

你不需要解码链接，因为编码形式是有效的。你必须使用编码形式，接收服务器需要解码它。

要检测URL是否已进行HTML转义，可以检查它是否包含分号字符;，因为在URL中分号是保留字符（参见RFC 1738），而HTML转义序列包含分号字符。因此，decode()函数可以如下所示：

func decode(s string) string {
    if strings.IndexByte(s, ';') >= 0 {
        s = html.UnescapeString(s)
    }
    return s
}

如果你担心恶意或无效的URL，可以解析并重新编码URL：

func decode(s string) (string, bool) {
    if strings.IndexByte(s, ';') >= 0 {
        s = html.UnescapeString(s)
    }
    u, err := url.ParseRequestURI(s)
    if err != nil {
        return "", false
    }
    return u.String(), true
}

进行测试：

fmt.Println(decode(`http//foo.bar`))
fmt.Println(decode(`http://foo.bar/doc?query=abc#first`))
fmt.Println(decode(`https://www.unsafe.com/a&lt;b&gt;c{d}e[f ]\g^`))

这将输出（在Go Playground上尝试）：

false
http://foo.bar/doc?query=abc#first true
https://www.unsafe.com/a%3Cb%3Ec%7Bd%7De%5Bf%20%5D%5Cg%5E true

英文:

The URL encoded example is good as-is, that's how you transmit data as part of the URL. If you need the decoded version, parse the URL and print its URL.Fragment field.

As to the second, simply use html.Unescape().

For example:

s := &quot;https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France&quot;
u, err := url.Parse(s)
if err != nil {
	panic(err)
}

fmt.Println(u.Fragment)

s2 := &quot;http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&amp;amp;mid=2658568&amp;amp;idx=1&amp;amp;sn=b50084652c901&amp;amp;chksm=f0cb0fabcee7d4&amp;amp;scene=21#wechat_redirect&quot;
fmt.Println(html.UnescapeString(s2))

This will output (try it on the Go Playground):

:~:text=La Russie a engrang&#233; 93,qui &#233;pingle particuli&#232;rement la France
http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect

You do not need to decode the link, as the encoded form is the valid one. You must use the encoded form, the receiving server is the one who needs to decode it.

To detect if the URL is HTML escaped, you may check if it contains the semicolon character ; as it is reserved in URLs (see RFC 1738), and HTML escape sequences contain the semicolon character. So decode() may look like this:

func decode(s string) string {
	if strings.IndexByte(s, &#39;;&#39;) &gt;= 0 {
		s = html.UnescapeString(s)
	}
	return s
}

If you're afraid of malicious or invalid URLs, you may parse and reencode the URL:

func decode(s string) (string, bool) {
	if strings.IndexByte(s, &#39;;&#39;) &gt;= 0 {
		s = html.UnescapeString(s)
	}
	u, err := url.ParseRequestURI(s)
	if err != nil {
		return &quot;&quot;, false
	}
	return u.String(), true
}

Testing it:

fmt.Println(decode(`http//foo.bar`))
fmt.Println(decode(`http://foo.bar/doc?query=abc#first`))
fmt.Println(decode(`https://www.unsafe.com/a&lt;b&gt;c{d}e[f ]\g^`))

This will output (try it on the Go Playground):

 false
http://foo.bar/doc?query=abc#first true
https://www.unsafe.com/a%3Cb%3Ec%7Bd%7De%5Bf%20%5D%5Cg%5E true

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

完全取消转义和百分号解码URL

问题

答案1

当我再次运行main.go时，为什么视图仍然相同？

golang: How can I use pflag with other packages that use flag?

Trying to dereference an interface that is a pointer to a struct object on the back end so I can pass by value to a function

Div with border and font awesome icon won’t center.

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论