2016年3月13日 04:34:37go评论182阅读模式

英文:

Wrong URL parsing with url.ResolveReference() if http:// is missing

问题

我已经构建了一个网络爬虫，用于搜索网页上的所有链接，并且会继续搜索这些链接上的链接，直到整个页面都被爬取。在遇到一个特殊的网站时出现了问题。

问题出在他们的链接上：

正常情况1：绝对路径，如'http://www.example.com/test'

正常情况2：相对路径，如'/test'

问题的新情况：没有'http://'的绝对路径，只有'www.example.com'

下面是展示问题的示例代码：

package main

import (
	"fmt"
	"log"
	"net/url"
)

func main() {

	u, err := url.Parse("http://www.example.com")
	if err != nil {
		log.Fatal(err)
	}
	base, err := url.Parse("http://example.com/directory/")
	if err != nil {
		log.Fatal(err)
	}

	u2, err := url.Parse("www.example.com")
	if err != nil {
		log.Fatal(err)
	}
	base2, err := url.Parse("http://example.com/directory/")
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(base.ResolveReference(u))
	fmt.Println(base2.ResolveReference(u2))
}

输出结果为：

http://www.example.com
http://example.com/test/www.example.com

可以看到第二行返回了一个错误的URL，因为如果缺少'http://'，则绝对URL的测试结果为u.IsAbs() = false...

有什么办法可以修复这个问题吗？我每天需要测试10万到100万个链接，甚至更多，所以性能很重要。

英文:

I've build a web crawler that searches a website for all links on that page and take this links and search on them for more links until the whole page is crawled. Worked perfectly until I came across a special site.

Problem with their linking:

Normal case 1: absolute path like 'http://www.example.com/test'

Normal case 2: relative path like '/test'

Problematic new case: absolute path without the http:// - just 'www.example.com'

Example code that shows the problem:

package main

import (
    &quot;fmt&quot;
    &quot;log&quot;
    &quot;net/url&quot;
)

func main() {

    u, err := url.Parse(&quot;http://www.example.com&quot;)
    if err != nil {
	    log.Fatal(err)
    }
    base, err := url.Parse(&quot;http://example.com/directory/&quot;)
        if err != nil {
	        log.Fatal(err)
        }

    u2, err := url.Parse(&quot;www.example.com&quot;)
    if err != nil {
	    log.Fatal(err)
    }
    base2, err := url.Parse(&quot;http://example.com/directory/&quot;)
        if err != nil {
	        log.Fatal(err)
        }

    fmt.Println(base.ResolveReference(u))
    fmt.Println(base2.ResolveReference(u2))
}

http://www.example.com
http://example.com/test/www.example.com

As you can see the second line gives back a wrong URL because the test for an absolute URL is u.IsAbs() = false if the http:// is missing ...

Any ideas how to fix that? I have to test 100.000 - 1.000.000 links on a daily base, maybe more and it needs to be performant.

答案1

得分: 1

很遗憾，对于这个问题没有真正的“修复”方法，因为如果你得到一个带有如下 href 的链接：

www.example.com

在一般情况下，它可能是模棱两可的：

http://host.tld/path/to/www.example.com
http://www.example.com

事实上，大多数浏览器会将这样的链接：

<a href="www.example.com">

解释为：

<a href="/current/path/www.example.com">

我建议你做同样的处理（因为这是网站的错误），如果你得到一个 404 错误，就像对待其他错误一样处理。

英文:

Unfortunately there's no real "fix" for this, because if you get a link with an href like this:

www.example.com

In the general case it's ambiguous between:

http://host.tld/path/to/www.example.com
http://www.example.com

In fact, most browsers treat a link like this:

&lt;a href=&quot;www.example.com&quot;&gt;

As this:

&lt;a href=&quot;/current/path/www.example.com&quot;&gt;

I'd suggest doing the same (since this is a bug with the person's website), and if you get a 404 just treat it as you would any other.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如果缺少”http://”，使用url.ResolveReference()解析URL会出错。

问题

答案1

Golang实现未知数量的数据存储过滤器（需要一个函数来追加过滤器）

在Go语言中对字符串进行Base64解码。

What does the Go Mod require mean

分布式负载转发

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论