Colly difference between Request.Visit and collector.Visit

huangapple go评论73阅读模式

Colly difference between Request.Visit and collector.Visit



func main() {
	// 临时变量
	var tcountry, tport string

	// 创建Colly收集器
	c := colly.NewCollector()

	// 忽略robots.txt
	c.IgnoreRobotsTxt = true
	// 设置请求超时时间为20秒
	c.SetRequestTimeout(20 * time.Second)
	// 在请求期间使用随机代理

	// 设置Colly操作的限制
		// 过滤受此规则影响的域名
		DomainGlob: "*",
		// 在这些域名之间设置请求延迟
		Delay: 1 * time.Second,
		// 添加额外的随机延迟
		RandomDelay: 3 * time.Second,

	// 查找并访问所有国家链接
	c.OnHTML("#clist", func(e *colly.HTMLElement) {
		e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
			tcountry = el.ChildText("a")
			link := el.ChildAttr("a", "href")
			fmt.Println("Country: ", tcountry, link)

	// 查找并访问所有港口链接
	c.OnHTML("#plist", func(h *colly.HTMLElement) {
		h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
			tport = el.ChildText("a")
			link := el.ChildAttr("a", "href")
			fmt.Println("Port: ", tport, link)

	// 查找并访问所有港口信息页面
	c.OnHTML("div.row", func(e *colly.HTMLElement) {
		portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
		fmt.Println("Port Authority: ", portAuth)



  1. 此外,我被迫使用e.Request.Visit,因为如果我克隆了c并使用d.Visit,它不会被执行。我注意到,当我将c克隆为d并用于获取“港口信息”部分时,整个代码块被跳过了。我在这里做错了什么/为什么会出现这种情况?

  2. 在当前的代码中,fmt.Println("Port Authority: ", portAuth)会执行两次。我得到以下输出:

Country:  Albania /maritime/albania
Port:  Durres /port/durres_al
Port Authority:  Durres Port Authority
Port Authority:  
Port:  Sarande /port/sarande_al
Port Authority:  Sarande Port Authority
Port Authority:  
Port:  Shengjin /port/shengjin_al
Port Authority:  Shengjin Port Authority
Port Authority:  

我不明白为什么会打印两次。请帮忙解决一下 Colly difference between Request.Visit and collector.Visit


I have written a colly script to collect port authority information from a site.

func main() {
// Temp Variables
var tcountry, tport string
// Colly collector
c := colly.NewCollector()
//Ignore the robot.txt
c.IgnoreRobotsTxt = true
// Time-out after 20 seconds.
c.SetRequestTimeout(20 * time.Second)
//use random agents during requests
//set limits to colly opoeration
// 	// Filter domains affected by this rule
DomainGlob: "*",
// 	// Set a delay between requests to these domains
Delay: 1 * time.Second,
// 	// Add an additional random delay
RandomDelay: 3 * time.Second,
// Find and visit all country links
c.OnHTML("#clist", func(e *colly.HTMLElement) {
// fmt.Println("Country List: ", h.ChildAttrs("a", "href"))
e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tcountry = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Country: ", tcountry, link)
// Find and visit all ports links
c.OnHTML("#plist", func(h *colly.HTMLElement) {
// fmt.Println("Port List: ", h.ChildAttrs("a", "href"))
h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tport = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Port: ", tport, link)
// Find and visit all ports info page
c.OnHTML("div.row", func(e *colly.HTMLElement) {
portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
fmt.Println("Port Authority: ", portAuth)

I have two questions below:

  1. Furthermore, I am kind of forced to use e.Request.Visit because d.Visit (if I clone c) doesn't get executed. I see that while I cloned c as d and used to get the 'port info' part, the whole block was skipped. What am I doing wrong here/why this behavior?

  2. In the current code as is the fmt.Println("Port Authority: ", portAuth) get executed twice. I am getting a print as below:

❯ go run .
Country:  Albania /maritime/albania
Port:  Durres /port/durres_al
Port Authority:  Durres Port Authority
Port Authority:  
Port:  Sarande /port/sarande_al
Port Authority:  Sarande Port Authority
Port Authority:  
Port:  Shengjin /port/shengjin_al
Port Authority:  Shengjin Port Authority
Port Authority:  

Again, I am failing to understand why it's getting printed twice. Kindly help Colly difference between Request.Visit and collector.Visit


得分: 1


collector.Visit -

Request.Visit -




if c.CheckHead {
if check := c.scrape(URL, "HEAD", 1, nil, nil, nil, true); check != nil {
return check
return c.scrape(URL, "GET", 1, nil, nil, nil, true)


return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, r.Ctx, nil, true)


func main() {
// 临时变量
var tcountry, tport string
// Colly collector
c := colly.NewCollector()
// 忽略robots.txt
c.IgnoreRobotsTxt = true
// 20秒后超时
c.SetRequestTimeout(20 * time.Second)
// 在请求期间使用随机代理
// 设置colly操作的限制
// 过滤受此规则影响的域名
DomainGlob: "*",
// 在这些域名之间设置延迟
Delay: 1 * time.Second,
// 添加额外的随机延迟
RandomDelay: 3 * time.Second,
d := c.Clone()
// 查找并访问所有国家链接
c.OnHTML("#clist", func(e *colly.HTMLElement) {
// fmt.Println("Country List: ", h.ChildAttrs("a", "href"))
e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tcountry = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Country: ", tcountry, link)
// 查找并访问所有港口链接
c.OnHTML("#plist", func(h *colly.HTMLElement) {
// fmt.Println("Port List: ", h.ChildAttrs("a", "href"))
h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tport = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Port: ", tport, link)
absoluteURL := h.Request.AbsoluteURL(link)
// 查找并访问所有港口信息页面
d.OnHTML("div.row", func(e *colly.HTMLElement) {
portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
if len(portAuth) > 0 {
fmt.Println("Port Authority: ", portAuth)




From the Go documentation:

collector.Visit -
Visit starts Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided callbacks

Request.Visit -
Visit continues Collector's collecting job by creating a request and preserves the Context of the previous request. Visit also calls the previously provided callbacks.

The difference then is the depth parameter and context. If you use the collector.Visit inside of an event handler the depth is always 1.

Here are the invocation differences:


if c.CheckHead {
if check := c.scrape(URL, "HEAD", 1, nil, nil, nil, true); check != nil {
return check
return c.scrape(URL, "GET", 1, nil, nil, nil, true)


return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, r.Ctx, nil, true)

Addressing your questions specifically, to invoke the cloned d, you would need to trigger a d.Visit within a c.OnHTML event handler. See the coursera example. You also need to use the AbsoluteURL as the cloned collector doesn't have context of the link (e.g. if it's relative). Here is it all put together:

func main() {
// Temp Variables
var tcountry, tport string
// Colly collector
c := colly.NewCollector()
//Ignore the robot.txt
c.IgnoreRobotsTxt = true
// Time-out after 20 seconds.
c.SetRequestTimeout(20 * time.Second)
//use random agents during requests
//set limits to colly opoeration
//  // Filter domains affected by this rule
DomainGlob: "*",
//  // Set a delay between requests to these domains
Delay: 1 * time.Second,
//  // Add an additional random delay
RandomDelay: 3 * time.Second,
d := c.Clone()
// Find and visit all country links
c.OnHTML("#clist", func(e *colly.HTMLElement) {
// fmt.Println("Country List: ", h.ChildAttrs("a", "href"))
e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tcountry = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Country: ", tcountry, link)
// Find and visit all ports links
c.OnHTML("#plist", func(h *colly.HTMLElement) {
// fmt.Println("Port List: ", h.ChildAttrs("a", "href"))
h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tport = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Port: ", tport, link)
absoluteURL := h.Request.AbsoluteURL(link)
// Find and visit all ports info page
d.OnHTML("div.row", func(e *colly.HTMLElement) {
portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
if len(portAuth) > 0 {
fmt.Println("Port Authority: ", portAuth)

Notice how the absolute URL is used because the context is different across collectors and so the cloned collector is not able to navigate the relative URL link.

Regarding the second question of why it's printed twice, it's because there are 2 div.row elements on the given page. I've tried various different CSS selection methods to apply the event handler to only the first div.row, but it's easier to just add a check for the string length to be greater than 0.

  • 本文由 发表于 2021年12月26日 05:54:08
  • 转载请务必保留本文链接:



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
