我无法使用Colly Go来进行Forbes亿万富翁网站的网络爬取。

huangapple go评论85阅读模式
英文:

I cannot web-scrape forbes top billionares website with colly go

问题

package main

import (
	"encoding/csv"
	"fmt"
	"os"

	"github.com/gocolly/colly"
)

func checkError(err error) {
	if err != nil {
		panic(err)
	}
}

func main() {
	fName := "data.csv"
	file, err := os.Create(fName)
	checkError(err)
	defer file.Close()
	writer := csv.NewWriter(file)
	defer writer.Flush()
	c := colly.NewCollector(colly.AllowedDomains("forbes.com", "www.forbes.com"))
	c.OnHTML(".scrolly-table tbody tr", func(e *colly.HTMLElement) {
		writer.Write([]string{
			e.ChildText(".rank .ng-binding"),
		})
	})
	c.OnError(func(_ *colly.Response, err error) {
		fmt.Println("Something went wrong:", err)
	})
	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})
	c.OnResponse(func(r *colly.Response) {
		fmt.Println("Visited", string(r.Body))
	})
	c.Visit("https://forbes.com/real-time-billionaires/")
}

这是我的代码,当我发送请求时,我得到了回退页面。这是我尝试爬取的 Forbes 链接

我注意到该网站使用了哈希路径,位于 URL 的最后一部分,我不能使用相同的 URL 两次发送请求,我认为这与爬取有关。有人可以帮我解决这个问题吗?

英文:
package main

import (
"encoding/csv"
"fmt"
"os"

"github.com/gocolly/colly"
)

 func checkError(err error){
 if err!=nil{
	panic(err)
}
}
func main(){
fName:="data.csv"
file,err:=os.Create(fName)
checkError(err)
defer file.Close()
writer:=csv.NewWriter(file)
defer writer.Flush()
c:=colly.NewCollector(colly.AllowedDomains("forbes.com","www.forbes.com"))
c.OnHTML(".scrolly-table tbody tr", func(e *colly.HTMLElement) {
		writer.Write([]string{
			e.ChildText(".rank .ng-binding"),
		})
	})	
	c.OnError(func(_ *colly.Response, err error) {
		fmt.Println("Something went wrong:", err)
	})
	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})
	c.OnResponse(func(r *colly.Response) {
		fmt.Println("Visited", string(r.Body))
	})
	c.Visit("https://forbes.com/real-time-billionaires/")
     }

This is my code, when i requested i am getting the fallback page ,This is the link for forbes that i am trying to scrape

I have noticed that the website uses hash path which is at the last part of url and i cannot request with the same url twice, and i think its somehow related to scraping, can anyone help me with this?

答案1

得分: 3

如果在浏览器中禁用 JavaScript(可以使用开发者工具进行操作),请确保查看可用的内容。大多数网络爬虫只能获取页面的文本表示,而浏览器还会对其运行 JavaScript 引擎。如果你尝试爬取的数据是由 JavaScript 填充的,很有可能这就是你无法进行爬取的原因。

英文:

Make sure what is available if you disable javascript in your browser (you can do it using the developer tools). Most scrapers will only get you the textual representation of the page, while the browser will also run javascript engine against it. If the data you are trying to scrape is populated with Javascript, there is a very good chance that is the reason you can't scrape it.

答案2

得分: 0

Colly只能用于静态爬取,chromedp可以用于爬取客户端渲染的应用程序。

英文:

Colly can only be used for static scraping, chromedp can be used for scraping client side rendered applications.

huangapple
  • 本文由 发表于 2021年11月1日 13:07:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/69792838.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定