2021年12月26日 05:54:08go评论73阅读模式

英文:

Colly difference between Request.Visit and collector.Visit

问题

我已经编写了一个colly脚本，用于从一个网站收集港口管理机构的信息。

func main() {
	// 临时变量
	var tcountry, tport string

	// 创建Colly收集器
	c := colly.NewCollector()

	// 忽略robots.txt
	c.IgnoreRobotsTxt = true
	// 设置请求超时时间为20秒
	c.SetRequestTimeout(20 * time.Second)
	// 在请求期间使用随机代理
	extensions.RandomUserAgent(c)

	// 设置Colly操作的限制
	c.Limit(&colly.LimitRule{
		// 过滤受此规则影响的域名
		DomainGlob: "searates.com/*",
		// 在这些域名之间设置请求延迟
		Delay: 1 * time.Second,
		// 添加额外的随机延迟
		RandomDelay: 3 * time.Second,
	})

	// 查找并访问所有国家链接
	c.OnHTML("#clist", func(e *colly.HTMLElement) {
		e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
			tcountry = el.ChildText("a")
			link := el.ChildAttr("a", "href")
			fmt.Println("Country: ", tcountry, link)
			e.Request.Visit(link)
		})
	})

	// 查找并访问所有港口链接
	c.OnHTML("#plist", func(h *colly.HTMLElement) {
		h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
			tport = el.ChildText("a")
			link := el.ChildAttr("a", "href")
			fmt.Println("Port: ", tport, link)
			h.Request.Visit(link)
		})
	})

	// 查找并访问所有港口信息页面
	c.OnHTML("div.row", func(e *colly.HTMLElement) {
		portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
		fmt.Println("Port Authority: ", portAuth)
	})

	c.Visit("https://www.searates.com/maritime/")
}

我有以下两个问题：

此外，我被迫使用e.Request.Visit，因为如果我克隆了c并使用d.Visit，它不会被执行。我注意到，当我将c克隆为d并用于获取“港口信息”部分时，整个代码块被跳过了。我在这里做错了什么/为什么会出现这种情况？
在当前的代码中，fmt.Println("Port Authority: ", portAuth)会执行两次。我得到以下输出：

Country:  Albania /maritime/albania
Port:  Durres /port/durres_al
Port Authority:  Durres Port Authority
Port Authority:  
Port:  Sarande /port/sarande_al
Port Authority:  Sarande Port Authority
Port Authority:  
Port:  Shengjin /port/shengjin_al
Port Authority:  Shengjin Port Authority
Port Authority:

我不明白为什么会打印两次。请帮忙解决一下

英文:

I have written a colly script to collect port authority information from a site.

func main() {
// Temp Variables
var tcountry, tport string
// Colly collector
c := colly.NewCollector()
//Ignore the robot.txt
c.IgnoreRobotsTxt = true
// Time-out after 20 seconds.
c.SetRequestTimeout(20 * time.Second)
//use random agents during requests
extensions.RandomUserAgent(c)
//set limits to colly opoeration
c.Limit(&amp;colly.LimitRule{
// 	// Filter domains affected by this rule
DomainGlob: &quot;searates.com/*&quot;,
// 	// Set a delay between requests to these domains
Delay: 1 * time.Second,
// 	// Add an additional random delay
RandomDelay: 3 * time.Second,
})
// Find and visit all country links
c.OnHTML(&quot;#clist&quot;, func(e *colly.HTMLElement) {
// fmt.Println(&quot;Country List: &quot;, h.ChildAttrs(&quot;a&quot;, &quot;href&quot;))
e.ForEach(&quot;li.col-xs-6.col-md-3&quot;, func(_ int, el *colly.HTMLElement) {
tcountry = el.ChildText(&quot;a&quot;)
link := el.ChildAttr(&quot;a&quot;, &quot;href&quot;)
fmt.Println(&quot;Country: &quot;, tcountry, link)
e.Request.Visit(link)
})
})
// Find and visit all ports links
c.OnHTML(&quot;#plist&quot;, func(h *colly.HTMLElement) {
// fmt.Println(&quot;Port List: &quot;, h.ChildAttrs(&quot;a&quot;, &quot;href&quot;))
h.ForEach(&quot;li.col-xs-6.col-md-3&quot;, func(_ int, el *colly.HTMLElement) {
tport = el.ChildText(&quot;a&quot;)
link := el.ChildAttr(&quot;a&quot;, &quot;href&quot;)
fmt.Println(&quot;Port: &quot;, tport, link)
h.Request.Visit(link)
})
})
// Find and visit all ports info page
c.OnHTML(&quot;div.row&quot;, func(e *colly.HTMLElement) {
portAuth := e.ChildText(&quot;table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)&quot;)
fmt.Println(&quot;Port Authority: &quot;, portAuth)
})
c.Visit(&quot;https://www.searates.com/maritime/&quot;)
}

I have two questions below:

Furthermore, I am kind of forced to use e.Request.Visit because d.Visit (if I clone c) doesn't get executed. I see that while I cloned c as d and used to get the 'port info' part, the whole block was skipped. What am I doing wrong here/why this behavior?
In the current code as is the fmt.Println("Port Authority: ", portAuth) get executed twice. I am getting a print as below:

❯ go run .
Country:  Albania /maritime/albania
Port:  Durres /port/durres_al
Port Authority:  Durres Port Authority
Port Authority:  
Port:  Sarande /port/sarande_al
Port Authority:  Sarande Port Authority
Port Authority:  
Port:  Shengjin /port/shengjin_al
Port Authority:  Shengjin Port Authority
Port Authority:

Again, I am failing to understand why it's getting printed twice. Kindly help

答案1

得分: 1

从Go文档中：

collector.Visit -
Visit通过创建一个请求到参数中指定的URL来启动Collector的收集工作。Visit还调用之前提供的回调函数。

Request.Visit -
Visit通过创建一个请求并保留上一个请求的上下文来继续Collector的收集工作。Visit还调用之前提供的回调函数。

那么区别在于深度参数和上下文。如果你在事件处理程序中使用collector.Visit，深度始终为1。

以下是调用的区别：

collector.Visit:

if c.CheckHead {
if check := c.scrape(URL, "HEAD", 1, nil, nil, nil, true); check != nil {
return check
}
}
return c.scrape(URL, "GET", 1, nil, nil, nil, true)

Request.Visit:

return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, r.Ctx, nil, true)

具体回答你的问题，要调用克隆的d，你需要在c.OnHTML事件处理程序中触发d.Visit。参见coursera示例。你还需要使用AbsoluteURL作为克隆的collector没有链接的上下文（例如，如果它是相对链接）。下面是完整的代码：

func main() {
// 临时变量
var tcountry, tport string
// Colly collector
c := colly.NewCollector()
// 忽略robots.txt
c.IgnoreRobotsTxt = true
// 20秒后超时
c.SetRequestTimeout(20 * time.Second)
// 在请求期间使用随机代理
extensions.RandomUserAgent(c)
// 设置colly操作的限制
c.Limit(&colly.LimitRule{
// 过滤受此规则影响的域名
DomainGlob: "searates.com/*",
// 在这些域名之间设置延迟
Delay: 1 * time.Second,
// 添加额外的随机延迟
RandomDelay: 3 * time.Second,
})
d := c.Clone()
// 查找并访问所有国家链接
c.OnHTML("#clist", func(e *colly.HTMLElement) {
// fmt.Println("Country List: ", h.ChildAttrs("a", "href"))
e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tcountry = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Country: ", tcountry, link)
e.Request.Visit(link)
})
})
// 查找并访问所有港口链接
c.OnHTML("#plist", func(h *colly.HTMLElement) {
// fmt.Println("Port List: ", h.ChildAttrs("a", "href"))
h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tport = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Port: ", tport, link)
absoluteURL := h.Request.AbsoluteURL(link)
d.Visit(absoluteURL)
})
})
// 查找并访问所有港口信息页面
d.OnHTML("div.row", func(e *colly.HTMLElement) {
portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
if len(portAuth) > 0 {
fmt.Println("Port Authority: ", portAuth)
}
})
c.Visit("https://www.searates.com/maritime/")
}

注意使用绝对URL，因为不同的collector上下文不同，所以克隆的collector无法导航到相对URL链接。

关于第二个问题为什么打印两次，是因为给定页面上有2个div.row元素。我尝试了各种不同的CSS选择方法，只将事件处理程序应用于第一个div.row，但更容易的方法是添加一个检查字符串长度大于0的条件。

英文:

From the Go documentation:

collector.Visit -
Visit starts Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided callbacks

Request.Visit -
Visit continues Collector's collecting job by creating a request and preserves the Context of the previous request. Visit also calls the previously provided callbacks.

The difference then is the depth parameter and context. If you use the collector.Visit inside of an event handler the depth is always 1.

Here are the invocation differences:

collector.Visit:

if c.CheckHead {
if check := c.scrape(URL, &quot;HEAD&quot;, 1, nil, nil, nil, true); check != nil {
return check
}
}
return c.scrape(URL, &quot;GET&quot;, 1, nil, nil, nil, true)

Request.Visit:

return r.collector.scrape(r.AbsoluteURL(URL), &quot;GET&quot;, r.Depth+1, nil, r.Ctx, nil, true)

Addressing your questions specifically, to invoke the cloned d, you would need to trigger a d.Visit within a c.OnHTML event handler. See the coursera example. You also need to use the AbsoluteURL as the cloned collector doesn't have context of the link (e.g. if it's relative). Here is it all put together:

func main() {
// Temp Variables
var tcountry, tport string
// Colly collector
c := colly.NewCollector()
//Ignore the robot.txt
c.IgnoreRobotsTxt = true
// Time-out after 20 seconds.
c.SetRequestTimeout(20 * time.Second)
//use random agents during requests
extensions.RandomUserAgent(c)
//set limits to colly opoeration
c.Limit(&amp;colly.LimitRule{
//  // Filter domains affected by this rule
DomainGlob: &quot;searates.com/*&quot;,
//  // Set a delay between requests to these domains
Delay: 1 * time.Second,
//  // Add an additional random delay
RandomDelay: 3 * time.Second,
})
d := c.Clone()
// Find and visit all country links
c.OnHTML(&quot;#clist&quot;, func(e *colly.HTMLElement) {
// fmt.Println(&quot;Country List: &quot;, h.ChildAttrs(&quot;a&quot;, &quot;href&quot;))
e.ForEach(&quot;li.col-xs-6.col-md-3&quot;, func(_ int, el *colly.HTMLElement) {
tcountry = el.ChildText(&quot;a&quot;)
link := el.ChildAttr(&quot;a&quot;, &quot;href&quot;)
fmt.Println(&quot;Country: &quot;, tcountry, link)
e.Request.Visit(link)
})
})
// Find and visit all ports links
c.OnHTML(&quot;#plist&quot;, func(h *colly.HTMLElement) {
// fmt.Println(&quot;Port List: &quot;, h.ChildAttrs(&quot;a&quot;, &quot;href&quot;))
h.ForEach(&quot;li.col-xs-6.col-md-3&quot;, func(_ int, el *colly.HTMLElement) {
tport = el.ChildText(&quot;a&quot;)
link := el.ChildAttr(&quot;a&quot;, &quot;href&quot;)
fmt.Println(&quot;Port: &quot;, tport, link)
absoluteURL := h.Request.AbsoluteURL(link)
d.Visit(absoluteURL)
})
})
// Find and visit all ports info page
d.OnHTML(&quot;div.row&quot;, func(e *colly.HTMLElement) {
portAuth := e.ChildText(&quot;table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)&quot;)
if len(portAuth) &gt; 0 {
fmt.Println(&quot;Port Authority: &quot;, portAuth)
}
})
c.Visit(&quot;https://www.searates.com/maritime/&quot;)
}

Notice how the absolute URL is used because the context is different across collectors and so the cloned collector is not able to navigate the relative URL link.

Regarding the second question of why it's printed twice, it's because there are 2 div.row elements on the given page. I've tried various different CSS selection methods to apply the event handler to only the first div.row, but it's easier to just add a check for the string length to be greater than 0.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Colly difference between Request.Visit and collector.Visit

问题

答案1

如何在Go语言中覆盖嵌套包中的变量？

Golang的for循环中使用两个变量的等价写法是什么？

在MongoDB文档中将字符串作为数组推送。

在Go可执行文件中包含模板文件

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论