2021年11月1日 13:07:37go评论128阅读模式

英文:

I cannot web-scrape forbes top billionares website with colly go

问题

package main
import (
	"encoding/csv"
	"fmt"
	"os"
	"github.com/gocolly/colly"
)
func checkError(err error) {
	if err != nil {
		panic(err)
	}
}
func main() {
	fName := "data.csv"
	file, err := os.Create(fName)
	checkError(err)
	defer file.Close()
	writer := csv.NewWriter(file)
	defer writer.Flush()
	c := colly.NewCollector(colly.AllowedDomains("forbes.com", "www.forbes.com"))
	c.OnHTML(".scrolly-table tbody tr", func(e *colly.HTMLElement) {
		writer.Write([]string{
			e.ChildText(".rank .ng-binding"),
		})
	})
	c.OnError(func(_ *colly.Response, err error) {
		fmt.Println("Something went wrong:", err)
	})
	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})
	c.OnResponse(func(r *colly.Response) {
		fmt.Println("Visited", string(r.Body))
	})
	c.Visit("https://forbes.com/real-time-billionaires/")
}

这是我的代码，当我发送请求时，我得到了回退页面。这是我尝试爬取的 Forbes 链接

我注意到该网站使用了哈希路径，位于 URL 的最后一部分，我不能使用相同的 URL 两次发送请求，我认为这与爬取有关。有人可以帮我解决这个问题吗？

英文:

package main
import (
&quot;encoding/csv&quot;
&quot;fmt&quot;
&quot;os&quot;
&quot;github.com/gocolly/colly&quot;
)
 func checkError(err error){
 if err!=nil{
	panic(err)
}
}
func main(){
fName:=&quot;data.csv&quot;
file,err:=os.Create(fName)
checkError(err)
defer file.Close()
writer:=csv.NewWriter(file)
defer writer.Flush()
c:=colly.NewCollector(colly.AllowedDomains(&quot;forbes.com&quot;,&quot;www.forbes.com&quot;))
c.OnHTML(&quot;.scrolly-table tbody tr&quot;, func(e *colly.HTMLElement) {
		writer.Write([]string{
			e.ChildText(&quot;.rank .ng-binding&quot;),
		})
	})	
	c.OnError(func(_ *colly.Response, err error) {
		fmt.Println(&quot;Something went wrong:&quot;, err)
	})
	c.OnRequest(func(r *colly.Request) {
		fmt.Println(&quot;Visiting&quot;, r.URL)
	})
	c.OnResponse(func(r *colly.Response) {
		fmt.Println(&quot;Visited&quot;, string(r.Body))
	})
	c.Visit(&quot;https://forbes.com/real-time-billionaires/&quot;)
     }

This is my code, when i requested i am getting the fallback page ,This is the link for forbes that i am trying to scrape

I have noticed that the website uses hash path which is at the last part of url and i cannot request with the same url twice, and i think its somehow related to scraping, can anyone help me with this?

答案1

得分: 3

如果在浏览器中禁用 JavaScript（可以使用开发者工具进行操作），请确保查看可用的内容。大多数网络爬虫只能获取页面的文本表示，而浏览器还会对其运行 JavaScript 引擎。如果你尝试爬取的数据是由 JavaScript 填充的，很有可能这就是你无法进行爬取的原因。

英文:

Make sure what is available if you disable javascript in your browser (you can do it using the developer tools). Most scrapers will only get you the textual representation of the page, while the browser will also run javascript engine against it. If the data you are trying to scrape is populated with Javascript, there is a very good chance that is the reason you can't scrape it.

答案2

得分: 0

Colly只能用于静态爬取，chromedp可以用于爬取客户端渲染的应用程序。

英文:

Colly can only be used for static scraping, chromedp can be used for scraping client side rendered applications.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

我无法使用Colly Go来进行Forbes亿万富翁网站的网络爬取。

问题

答案1

答案2

How do I unwrap a wrapped struct in golang?

使用Gorilla工具包以根URL提供静态内容。

如何在Go中使用另一个包中的类型声明变量？

return self struct clone in golang (without reflect)

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。