Go Colly无法从网站返回任何数据

huangapple go评论153阅读模式
英文:

Go Colly not returning any data from website

问题

我正在尝试使用Go语言制作一个简单的网络爬虫,但是我似乎无法从colly中获得最简单的功能。我从colly文档中获取了基本示例,虽然它在他们使用的hackernews.org网站上可以工作,但在我尝试爬取的网站上却无法工作。我尝试了多个url的迭代,包括使用https://、www.、以及在末尾加上/等等,但似乎都不起作用。我尝试使用Python中的Beautiful Soup爬取相同的网站,并成功获取了所有内容,所以我知道这个网站是可以被爬取的。感谢任何帮助。

package main

import (
	"fmt"

	"github.com/gocolly/colly"
)

// 主函数
func main() {
	/* 实例化colly */
	c := colly.NewCollector(
		colly.AllowedDomains("www.bjjheroes.com/"),
	)

	// 对于每个具有href属性的a元素,调用回调函数
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		fmt.Printf("找到链接:%q \n", e.Text)
	})

	c.Visit("www.bjjheroes.com/a-z-bjj-fighters-list")
}
英文:

I am trying to make a simple web scraper in go and I can't seem to get the most simple functionality from colly. I took the basic example from the colly docs and while it worked with the hackernews.org site they used it isn't working with the site I am trying to scrape. I tried several iterations of the url ie with https://, www. , with / at the end etc and nothing seems to work. I tried scraping the same site with beatiful soup in python and got everything so i know the site can be scraped. Any help is appreciated. Thanks.

package main

import (
	"fmt"

	"github.com/gocolly/colly"
)

// main function  
func main() {
	/* instatiate colly */
	c := colly.NewCollector(
		colly.AllowedDomains("www.bjjheroes.com/"),
	)

	// On every a element which has href attribute call callback
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		fmt.Printf("Link found: %q \n", e.Text)
	})

	c.Visit("www.bjjheroes.com/a-z-bjj-fighters-list")
}

答案1

得分: 3

“错误”在于我在允许的域名中需要添加更多的变体,添加了以下内容后,一切都正常工作了:

		colly.AllowedDomains(
                  "www.bjjheroes.com/",
                  "bjjheroes.com/",
                  "https://bjjheroes.com/",
                  "www.bjjheroes.com",
                  "bjjheroes.com",
                  "https://bjjheroes.com",
                ),
英文:
  • The "error" was on my part in that the allowed domains needed several more variations, after adding
		colly.AllowedDomains(
                  "www.bjjheroes.com/", 
                  "bjjheroes.com/",
                  "https://bjjheroes.com/",
                  "www.bjjheroes.com", 
                  "bjjheroes.com",
                  "https://bjjheroes.com",
                ),

everything worked

huangapple
  • 本文由 发表于 2021年12月25日 17:19:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/70479051.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定