忽略Go Web爬虫中的外部链接。

huangapple go评论69阅读模式
英文:

Ignore external links in go web crawler

问题

我真的对Go语言很陌生,目前正在按照这个教程构建一个简单的网络爬虫:https://jdanger.com/build-a-web-crawler-in-go.html

这个教程写得非常清晰,但是我想要添加一个功能,只将属于主域名的链接加入到队列中,而不包括外部链接。

假设我正在爬取 https://www.mywebsite.com,我只想包括像 https://www.mywebsite.com/about-us 或者 https://www.mywebsite.com/contact 这样的链接,而不包括子域名链接,比如 https://subdomain.mywebsite.com,也不包括外部链接,比如 https://www.facebook.com,因为我不希望爬虫陷入黑洞。

看了代码后,我认为我需要修改修复相对链接的这个函数:

func fixUrl(href, base string) (string) {  // 给定一个相对链接和所在页面的链接
  uri, err := url.Parse(href)              // 我们可以解析它们
  if err != nil {                          // 然后使用 url 包的 ResolveReference 函数
    return ""                              // 来确定链接的真实目标。
  }                                        // 如果它不是相对链接,这个操作不会产生任何效果。
  baseUrl, err := url.Parse(base)          // 这是一个空操作。
  if err != nil {
    return ""
  }
  uri = baseUrl.ResolveReference(uri)
  return uri.String()                      // 在这个函数中我们使用解析后的 url 对象进行操作,但是返回一个普通字符串。
}

然而,我不太确定如何实现这个功能,我猜想可能需要使用 if/else 或者进一步解析。

非常感谢任何关于这个问题的提示,对我的学习非常有帮助。

英文:

I'm really new to go, and I'm playing with it at the moment by building a simple web crawler following this tutorial: <https://jdanger.com/build-a-web-crawler-in-go.html>

It's broken down really nicely, but I want to put something in place so that the only links which are enqueued are part of the main domain, and not external.

So let's say I'm crawling https://www.mywebsite.com, I only want to include things like https://www.mywebsite.com/about-us or https://www.mywebsite.com/contact - I don't want subdomains, such as https://subdomain.mywebsite.com or external links found like https://www.facebook.com as I do not want the crawler to fall into a black hole.

Looking at the code, I think I need to make the change to this function which fixes relative links:

func fixUrl(href, base string) (string) {  // given a relative link and the page on
  uri, err := url.Parse(href)              // which it&#39;s found we can parse them
  if err != nil {                          // both and use the url package&#39;s
    return &quot;&quot;                              // ResolveReference function to figure
  }                                        // out where the link really points.
  baseUrl, err := url.Parse(base)          // If it&#39;s not a relative link this
  if err != nil {                          // is a no-op.
    return &quot;&quot;
  }
  uri = baseUrl.ResolveReference(uri)
  return uri.String()                      // We work with parsed url objects in this
}                                          // func but we return a plain string.

However I'm not 100% sure how to do that, I'm assuming some sort of if/else or further parsing is required.

Any tips would be hugely appreciated for my learning

答案1

得分: 1

我快速阅读了jdanger的教程并运行了完整的示例。毫无疑问,有几种方法可以实现你想要做的事情,但这是我的看法。

基本上,你想要将任何域名与某个指定的域名不匹配的URL加入队列,这个指定的域名可能是作为命令行参数提供的。示例中使用fixUrl()函数来构建完整的绝对URL,并通过返回""来标记无效的URL。在这个函数中,它依赖于net/url包进行解析等操作,特别是依赖于URL数据类型URL是一个具有以下定义的struct

type URL struct {
    Scheme      string
    Opaque      string    // 编码的不透明数据
    User        *Userinfo // 用户名和密码信息
    Host        string    // 主机或主机:端口
    Path        string    // 路径(相对路径可以省略前导斜杠)
    RawPath     string    // 编码的路径提示(参见EscapedPath方法);在Go 1.5中添加
    ForceQuery  bool      // 即使RawQuery为空,也要追加查询('?');在Go 1.7中添加
    RawQuery    string    // 编码的查询值,不包括'?' 
    Fragment    string    // 引用的片段,不包括'#'
    RawFragment string    // 编码的片段提示(参见EscapedFragment方法);在Go 1.15中添加
}

需要注意的是HostHost是URL的'whatever.com'部分,包括子域和端口(参见维基百科文章)。进一步阅读文档,可以发现有一个Hostname()方法,如果存在的话,它将去除端口。

因此,虽然你可以将域名过滤添加到fixUrl()中,但在我看来,更好的设计是先“修复”URL,然后对结果进行额外的检查,看它的Host是否与所需的域名匹配。如果不匹配,则不将URL加入队列,并继续处理队列中的下一项。

所以,基本上我认为你走在了正确的轨道上。我没有提供代码示例,以鼓励你自己解决问题,但我已经将你的功能添加到了我本地复制的教程程序中。

英文:

I quickly read the jdanger tutorial and ran the complete example. No doubt there are a few ways to accomplish what you want to do, but here's my take.

You basically want to not enqueue any URL whose domain doesn't match some specified domain, presumably provided as a command line arg. The example uses the fixUrl() function to construct full absolute URLs and also to signal invalid URLs (by returning &quot;&quot;). In this function, it relies on the net/url package for parsing and such, and specifically on the URL data type. URL is a struct with this definition:

type URL struct {
    Scheme      string
    Opaque      string    // encoded opaque data
    User        *Userinfo // username and password information
    Host        string    // host or host:port
    Path        string    // path (relative paths may omit leading slash)
    RawPath     string    // encoded path hint (see EscapedPath method); added in Go 1.5
    ForceQuery  bool      // append a query (&#39;?&#39;) even if RawQuery is empty; added in Go 1.7
    RawQuery    string    // encoded query values, without &#39;?&#39;
    Fragment    string    // fragment for references, without &#39;#&#39;
    RawFragment string    // encoded fragment hint (see EscapedFragment method); added in Go 1.15
}

The one to take note of is Host. Host is the 'whatever.com' part of an URL, including subdomains, and the port (see this wikipedia article for more info). Further reading the documentation, there is a method Hostname() which will strip the port, if present.

So, although you could add domain filtering to fixUrl(), a better design, in my opinion, would be to 'fix' the URL first, then do an addition check on the result to see its Host matches the desired domain. If it does not match, do not enqueue the URL and continue to the next item in the queue.

So, basically I think you are on the right track. I haven't included a code example to encourage you to work it out yourself, though I did add your feature to my local copy of the tutorial's program.

huangapple
  • 本文由 发表于 2021年6月8日 15:50:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/67883506.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定