Colly difference between Request.Visit and collector.Visit

huangapple go评论73阅读模式
英文:

Colly difference between Request.Visit and collector.Visit

问题

我已经编写了一个colly脚本,用于从一个网站收集港口管理机构的信息。

func main() {
	// 临时变量
	var tcountry, tport string

	// 创建Colly收集器
	c := colly.NewCollector()

	// 忽略robots.txt
	c.IgnoreRobotsTxt = true
	// 设置请求超时时间为20秒
	c.SetRequestTimeout(20 * time.Second)
	// 在请求期间使用随机代理
	extensions.RandomUserAgent(c)

	// 设置Colly操作的限制
	c.Limit(&colly.LimitRule{
		// 过滤受此规则影响的域名
		DomainGlob: "searates.com/*",
		// 在这些域名之间设置请求延迟
		Delay: 1 * time.Second,
		// 添加额外的随机延迟
		RandomDelay: 3 * time.Second,
	})

	// 查找并访问所有国家链接
	c.OnHTML("#clist", func(e *colly.HTMLElement) {
		e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
			tcountry = el.ChildText("a")
			link := el.ChildAttr("a", "href")
			fmt.Println("Country: ", tcountry, link)
			e.Request.Visit(link)
		})
	})

	// 查找并访问所有港口链接
	c.OnHTML("#plist", func(h *colly.HTMLElement) {
		h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
			tport = el.ChildText("a")
			link := el.ChildAttr("a", "href")
			fmt.Println("Port: ", tport, link)
			h.Request.Visit(link)
		})
	})

	// 查找并访问所有港口信息页面
	c.OnHTML("div.row", func(e *colly.HTMLElement) {
		portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
		fmt.Println("Port Authority: ", portAuth)
	})

	c.Visit("https://www.searates.com/maritime/")
}

我有以下两个问题:

  1. 此外,我被迫使用e.Request.Visit,因为如果我克隆了c并使用d.Visit,它不会被执行。我注意到,当我将c克隆为d并用于获取“港口信息”部分时,整个代码块被跳过了。我在这里做错了什么/为什么会出现这种情况?

  2. 在当前的代码中,fmt.Println("Port Authority: ", portAuth)会执行两次。我得到以下输出:

Country:  Albania /maritime/albania
Port:  Durres /port/durres_al
Port Authority:  Durres Port Authority
Port Authority:  
Port:  Sarande /port/sarande_al
Port Authority:  Sarande Port Authority
Port Authority:  
Port:  Shengjin /port/shengjin_al
Port Authority:  Shengjin Port Authority
Port Authority:  

我不明白为什么会打印两次。请帮忙解决一下 Colly difference between Request.Visit and collector.Visit

英文:

I have written a colly script to collect port authority information from a site.

func main() {
// Temp Variables
var tcountry, tport string
// Colly collector
c := colly.NewCollector()
//Ignore the robot.txt
c.IgnoreRobotsTxt = true
// Time-out after 20 seconds.
c.SetRequestTimeout(20 * time.Second)
//use random agents during requests
extensions.RandomUserAgent(c)
//set limits to colly opoeration
c.Limit(&colly.LimitRule{
// 	// Filter domains affected by this rule
DomainGlob: "searates.com/*",
// 	// Set a delay between requests to these domains
Delay: 1 * time.Second,
// 	// Add an additional random delay
RandomDelay: 3 * time.Second,
})
// Find and visit all country links
c.OnHTML("#clist", func(e *colly.HTMLElement) {
// fmt.Println("Country List: ", h.ChildAttrs("a", "href"))
e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tcountry = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Country: ", tcountry, link)
e.Request.Visit(link)
})
})
// Find and visit all ports links
c.OnHTML("#plist", func(h *colly.HTMLElement) {
// fmt.Println("Port List: ", h.ChildAttrs("a", "href"))
h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tport = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Port: ", tport, link)
h.Request.Visit(link)
})
})
// Find and visit all ports info page
c.OnHTML("div.row", func(e *colly.HTMLElement) {
portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
fmt.Println("Port Authority: ", portAuth)
})
c.Visit("https://www.searates.com/maritime/")
}

I have two questions below:

  1. Furthermore, I am kind of forced to use e.Request.Visit because d.Visit (if I clone c) doesn't get executed. I see that while I cloned c as d and used to get the 'port info' part, the whole block was skipped. What am I doing wrong here/why this behavior?

  2. In the current code as is the fmt.Println("Port Authority: ", portAuth) get executed twice. I am getting a print as below:

❯ go run .
Country:  Albania /maritime/albania
Port:  Durres /port/durres_al
Port Authority:  Durres Port Authority
Port Authority:  
Port:  Sarande /port/sarande_al
Port Authority:  Sarande Port Authority
Port Authority:  
Port:  Shengjin /port/shengjin_al
Port Authority:  Shengjin Port Authority
Port Authority:  

Again, I am failing to understand why it's getting printed twice. Kindly help Colly difference between Request.Visit and collector.Visit

答案1

得分: 1

从Go文档中:

collector.Visit -
Visit通过创建一个请求到参数中指定的URL来启动Collector的收集工作。Visit还调用之前提供的回调函数。

Request.Visit -
Visit通过创建一个请求并保留上一个请求的上下文来继续Collector的收集工作。Visit还调用之前提供的回调函数。

那么区别在于深度参数和上下文。如果你在事件处理程序中使用collector.Visit,深度始终为1。

以下是调用的区别:

collector.Visit:

if c.CheckHead {
if check := c.scrape(URL, "HEAD", 1, nil, nil, nil, true); check != nil {
return check
}
}
return c.scrape(URL, "GET", 1, nil, nil, nil, true)

Request.Visit:

return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, r.Ctx, nil, true)

具体回答你的问题,要调用克隆的d,你需要在c.OnHTML事件处理程序中触发d.Visit。参见coursera示例。你还需要使用AbsoluteURL作为克隆的collector没有链接的上下文(例如,如果它是相对链接)。下面是完整的代码:

func main() {
// 临时变量
var tcountry, tport string
// Colly collector
c := colly.NewCollector()
// 忽略robots.txt
c.IgnoreRobotsTxt = true
// 20秒后超时
c.SetRequestTimeout(20 * time.Second)
// 在请求期间使用随机代理
extensions.RandomUserAgent(c)
// 设置colly操作的限制
c.Limit(&colly.LimitRule{
// 过滤受此规则影响的域名
DomainGlob: "searates.com/*",
// 在这些域名之间设置延迟
Delay: 1 * time.Second,
// 添加额外的随机延迟
RandomDelay: 3 * time.Second,
})
d := c.Clone()
// 查找并访问所有国家链接
c.OnHTML("#clist", func(e *colly.HTMLElement) {
// fmt.Println("Country List: ", h.ChildAttrs("a", "href"))
e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tcountry = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Country: ", tcountry, link)
e.Request.Visit(link)
})
})
// 查找并访问所有港口链接
c.OnHTML("#plist", func(h *colly.HTMLElement) {
// fmt.Println("Port List: ", h.ChildAttrs("a", "href"))
h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tport = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Port: ", tport, link)
absoluteURL := h.Request.AbsoluteURL(link)
d.Visit(absoluteURL)
})
})
// 查找并访问所有港口信息页面
d.OnHTML("div.row", func(e *colly.HTMLElement) {
portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
if len(portAuth) > 0 {
fmt.Println("Port Authority: ", portAuth)
}
})
c.Visit("https://www.searates.com/maritime/")
}

注意使用绝对URL,因为不同的collector上下文不同,所以克隆的collector无法导航到相对URL链接。

关于第二个问题为什么打印两次,是因为给定页面上有2个div.row元素。我尝试了各种不同的CSS选择方法,只将事件处理程序应用于第一个div.row,但更容易的方法是添加一个检查字符串长度大于0的条件。

英文:

From the Go documentation:

collector.Visit -
Visit starts Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided callbacks

Request.Visit -
Visit continues Collector's collecting job by creating a request and preserves the Context of the previous request. Visit also calls the previously provided callbacks.

The difference then is the depth parameter and context. If you use the collector.Visit inside of an event handler the depth is always 1.

Here are the invocation differences:

collector.Visit:

if c.CheckHead {
if check := c.scrape(URL, "HEAD", 1, nil, nil, nil, true); check != nil {
return check
}
}
return c.scrape(URL, "GET", 1, nil, nil, nil, true)

Request.Visit:

return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, r.Ctx, nil, true)

Addressing your questions specifically, to invoke the cloned d, you would need to trigger a d.Visit within a c.OnHTML event handler. See the coursera example. You also need to use the AbsoluteURL as the cloned collector doesn't have context of the link (e.g. if it's relative). Here is it all put together:

func main() {
// Temp Variables
var tcountry, tport string
// Colly collector
c := colly.NewCollector()
//Ignore the robot.txt
c.IgnoreRobotsTxt = true
// Time-out after 20 seconds.
c.SetRequestTimeout(20 * time.Second)
//use random agents during requests
extensions.RandomUserAgent(c)
//set limits to colly opoeration
c.Limit(&colly.LimitRule{
//  // Filter domains affected by this rule
DomainGlob: "searates.com/*",
//  // Set a delay between requests to these domains
Delay: 1 * time.Second,
//  // Add an additional random delay
RandomDelay: 3 * time.Second,
})
d := c.Clone()
// Find and visit all country links
c.OnHTML("#clist", func(e *colly.HTMLElement) {
// fmt.Println("Country List: ", h.ChildAttrs("a", "href"))
e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tcountry = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Country: ", tcountry, link)
e.Request.Visit(link)
})
})
// Find and visit all ports links
c.OnHTML("#plist", func(h *colly.HTMLElement) {
// fmt.Println("Port List: ", h.ChildAttrs("a", "href"))
h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tport = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Port: ", tport, link)
absoluteURL := h.Request.AbsoluteURL(link)
d.Visit(absoluteURL)
})
})
// Find and visit all ports info page
d.OnHTML("div.row", func(e *colly.HTMLElement) {
portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
if len(portAuth) > 0 {
fmt.Println("Port Authority: ", portAuth)
}
})
c.Visit("https://www.searates.com/maritime/")
}

Notice how the absolute URL is used because the context is different across collectors and so the cloned collector is not able to navigate the relative URL link.

Regarding the second question of why it's printed twice, it's because there are 2 div.row elements on the given page. I've tried various different CSS selection methods to apply the event handler to only the first div.row, but it's easier to just add a check for the string length to be greater than 0.

huangapple
  • 本文由 发表于 2021年12月26日 05:54:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/70482959.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定