英文:
Ignore external links in go web crawler
问题
我真的对Go语言很陌生,目前正在按照这个教程构建一个简单的网络爬虫:https://jdanger.com/build-a-web-crawler-in-go.html
这个教程写得非常清晰,但是我想要添加一个功能,只将属于主域名的链接加入到队列中,而不包括外部链接。
假设我正在爬取 https://www.mywebsite.com,我只想包括像 https://www.mywebsite.com/about-us 或者 https://www.mywebsite.com/contact 这样的链接,而不包括子域名链接,比如 https://subdomain.mywebsite.com,也不包括外部链接,比如 https://www.facebook.com,因为我不希望爬虫陷入黑洞。
看了代码后,我认为我需要修改修复相对链接的这个函数:
func fixUrl(href, base string) (string) { // 给定一个相对链接和所在页面的链接
uri, err := url.Parse(href) // 我们可以解析它们
if err != nil { // 然后使用 url 包的 ResolveReference 函数
return "" // 来确定链接的真实目标。
} // 如果它不是相对链接,这个操作不会产生任何效果。
baseUrl, err := url.Parse(base) // 这是一个空操作。
if err != nil {
return ""
}
uri = baseUrl.ResolveReference(uri)
return uri.String() // 在这个函数中我们使用解析后的 url 对象进行操作,但是返回一个普通字符串。
}
然而,我不太确定如何实现这个功能,我猜想可能需要使用 if/else 或者进一步解析。
非常感谢任何关于这个问题的提示,对我的学习非常有帮助。
英文:
I'm really new to go, and I'm playing with it at the moment by building a simple web crawler following this tutorial: <https://jdanger.com/build-a-web-crawler-in-go.html>
It's broken down really nicely, but I want to put something in place so that the only links which are enqueued are part of the main domain, and not external.
So let's say I'm crawling https://www.mywebsite.com, I only want to include things like https://www.mywebsite.com/about-us or https://www.mywebsite.com/contact - I don't want subdomains, such as https://subdomain.mywebsite.com or external links found like https://www.facebook.com as I do not want the crawler to fall into a black hole.
Looking at the code, I think I need to make the change to this function which fixes relative links:
func fixUrl(href, base string) (string) { // given a relative link and the page on
uri, err := url.Parse(href) // which it's found we can parse them
if err != nil { // both and use the url package's
return "" // ResolveReference function to figure
} // out where the link really points.
baseUrl, err := url.Parse(base) // If it's not a relative link this
if err != nil { // is a no-op.
return ""
}
uri = baseUrl.ResolveReference(uri)
return uri.String() // We work with parsed url objects in this
} // func but we return a plain string.
However I'm not 100% sure how to do that, I'm assuming some sort of if/else or further parsing is required.
Any tips would be hugely appreciated for my learning
答案1
得分: 1
我快速阅读了jdanger的教程并运行了完整的示例。毫无疑问,有几种方法可以实现你想要做的事情,但这是我的看法。
基本上,你想要不将任何域名与某个指定的域名不匹配的URL加入队列,这个指定的域名可能是作为命令行参数提供的。示例中使用fixUrl()
函数来构建完整的绝对URL,并通过返回""
来标记无效的URL。在这个函数中,它依赖于net/url
包进行解析等操作,特别是依赖于URL
数据类型。URL
是一个具有以下定义的struct
:
type URL struct {
Scheme string
Opaque string // 编码的不透明数据
User *Userinfo // 用户名和密码信息
Host string // 主机或主机:端口
Path string // 路径(相对路径可以省略前导斜杠)
RawPath string // 编码的路径提示(参见EscapedPath方法);在Go 1.5中添加
ForceQuery bool // 即使RawQuery为空,也要追加查询('?');在Go 1.7中添加
RawQuery string // 编码的查询值,不包括'?'
Fragment string // 引用的片段,不包括'#'
RawFragment string // 编码的片段提示(参见EscapedFragment方法);在Go 1.15中添加
}
需要注意的是Host
。Host
是URL的'whatever.com'部分,包括子域和端口(参见维基百科文章)。进一步阅读文档,可以发现有一个Hostname()
方法,如果存在的话,它将去除端口。
因此,虽然你可以将域名过滤添加到fixUrl()
中,但在我看来,更好的设计是先“修复”URL,然后对结果进行额外的检查,看它的Host
是否与所需的域名匹配。如果不匹配,则不将URL加入队列,并继续处理队列中的下一项。
所以,基本上我认为你走在了正确的轨道上。我没有提供代码示例,以鼓励你自己解决问题,但我已经将你的功能添加到了我本地复制的教程程序中。
英文:
I quickly read the jdanger tutorial and ran the complete example. No doubt there are a few ways to accomplish what you want to do, but here's my take.
You basically want to not enqueue any URL whose domain doesn't match some specified domain, presumably provided as a command line arg. The example uses the fixUrl()
function to construct full absolute URLs and also to signal invalid URLs (by returning ""
). In this function, it relies on the net/url
package for parsing and such, and specifically on the URL
data type. URL
is a struct
with this definition:
type URL struct {
Scheme string
Opaque string // encoded opaque data
User *Userinfo // username and password information
Host string // host or host:port
Path string // path (relative paths may omit leading slash)
RawPath string // encoded path hint (see EscapedPath method); added in Go 1.5
ForceQuery bool // append a query ('?') even if RawQuery is empty; added in Go 1.7
RawQuery string // encoded query values, without '?'
Fragment string // fragment for references, without '#'
RawFragment string // encoded fragment hint (see EscapedFragment method); added in Go 1.15
}
The one to take note of is Host
. Host
is the 'whatever.com' part of an URL, including subdomains, and the port (see this wikipedia article for more info). Further reading the documentation, there is a method Hostname()
which will strip the port, if present.
So, although you could add domain filtering to fixUrl()
, a better design, in my opinion, would be to 'fix' the URL first, then do an addition check on the result to see its Host
matches the desired domain. If it does not match, do not enqueue the URL and continue to the next item in the queue.
So, basically I think you are on the right track. I haven't included a code example to encourage you to work it out yourself, though I did add your feature to my local copy of the tutorial's program.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论