英文:
Wrong URL parsing with url.ResolveReference() if http:// is missing
问题
我已经构建了一个网络爬虫,用于搜索网页上的所有链接,并且会继续搜索这些链接上的链接,直到整个页面都被爬取。在遇到一个特殊的网站时出现了问题。
问题出在他们的链接上:
正常情况1:绝对路径,如'http://www.example.com/test'
正常情况2:相对路径,如'/test'
问题的新情况:没有'http://'的绝对路径,只有'www.example.com'
下面是展示问题的示例代码:
package main
import (
"fmt"
"log"
"net/url"
)
func main() {
u, err := url.Parse("http://www.example.com")
if err != nil {
log.Fatal(err)
}
base, err := url.Parse("http://example.com/directory/")
if err != nil {
log.Fatal(err)
}
u2, err := url.Parse("www.example.com")
if err != nil {
log.Fatal(err)
}
base2, err := url.Parse("http://example.com/directory/")
if err != nil {
log.Fatal(err)
}
fmt.Println(base.ResolveReference(u))
fmt.Println(base2.ResolveReference(u2))
}
输出结果为:
http://www.example.com
http://example.com/test/www.example.com
可以看到第二行返回了一个错误的URL,因为如果缺少'http://',则绝对URL的测试结果为u.IsAbs() = false...
有什么办法可以修复这个问题吗?我每天需要测试10万到100万个链接,甚至更多,所以性能很重要。
英文:
I've build a web crawler that searches a website for all links on that page and take this links and search on them for more links until the whole page is crawled. Worked perfectly until I came across a special site.
Problem with their linking:
Normal case 1: absolute path like 'http://www.example.com/test'
Normal case 2: relative path like '/test'
Problematic new case: absolute path without the http:// - just 'www.example.com'
Example code that shows the problem:
package main
import (
"fmt"
"log"
"net/url"
)
func main() {
u, err := url.Parse("http://www.example.com")
if err != nil {
log.Fatal(err)
}
base, err := url.Parse("http://example.com/directory/")
if err != nil {
log.Fatal(err)
}
u2, err := url.Parse("www.example.com")
if err != nil {
log.Fatal(err)
}
base2, err := url.Parse("http://example.com/directory/")
if err != nil {
log.Fatal(err)
}
fmt.Println(base.ResolveReference(u))
fmt.Println(base2.ResolveReference(u2))
}
http://www.example.com
http://example.com/test/www.example.com
As you can see the second line gives back a wrong URL because the test for an absolute URL is u.IsAbs() = false if the http:// is missing ...
Any ideas how to fix that? I have to test 100.000 - 1.000.000 links on a daily base, maybe more and it needs to be performant.
答案1
得分: 1
很遗憾,对于这个问题没有真正的“修复”方法,因为如果你得到一个带有如下 href 的链接:
www.example.com
在一般情况下,它可能是模棱两可的:
http://host.tld/path/to/www.example.com
http://www.example.com
事实上,大多数浏览器会将这样的链接:
<a href="www.example.com">
解释为:
<a href="/current/path/www.example.com">
我建议你做同样的处理(因为这是网站的错误),如果你得到一个 404 错误,就像对待其他错误一样处理。
英文:
Unfortunately there's no real "fix" for this, because if you get a link with an href like this:
www.example.com
In the general case it's ambiguous between:
http://host.tld/path/to/www.example.com
http://www.example.com
In fact, most browsers treat a link like this:
<a href="www.example.com">
As this:
<a href="/current/path/www.example.com">
I'd suggest doing the same (since this is a bug with the person's website), and if you get a 404 just treat it as you would any other.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论