英文:
I cannot web-scrape forbes top billionares website with colly go
问题
package main
import (
"encoding/csv"
"fmt"
"os"
"github.com/gocolly/colly"
)
func checkError(err error) {
if err != nil {
panic(err)
}
}
func main() {
fName := "data.csv"
file, err := os.Create(fName)
checkError(err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
c := colly.NewCollector(colly.AllowedDomains("forbes.com", "www.forbes.com"))
c.OnHTML(".scrolly-table tbody tr", func(e *colly.HTMLElement) {
writer.Write([]string{
e.ChildText(".rank .ng-binding"),
})
})
c.OnError(func(_ *colly.Response, err error) {
fmt.Println("Something went wrong:", err)
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", string(r.Body))
})
c.Visit("https://forbes.com/real-time-billionaires/")
}
这是我的代码,当我发送请求时,我得到了回退页面。这是我尝试爬取的 Forbes 链接
我注意到该网站使用了哈希路径,位于 URL 的最后一部分,我不能使用相同的 URL 两次发送请求,我认为这与爬取有关。有人可以帮我解决这个问题吗?
英文:
package main
import (
"encoding/csv"
"fmt"
"os"
"github.com/gocolly/colly"
)
func checkError(err error){
if err!=nil{
panic(err)
}
}
func main(){
fName:="data.csv"
file,err:=os.Create(fName)
checkError(err)
defer file.Close()
writer:=csv.NewWriter(file)
defer writer.Flush()
c:=colly.NewCollector(colly.AllowedDomains("forbes.com","www.forbes.com"))
c.OnHTML(".scrolly-table tbody tr", func(e *colly.HTMLElement) {
writer.Write([]string{
e.ChildText(".rank .ng-binding"),
})
})
c.OnError(func(_ *colly.Response, err error) {
fmt.Println("Something went wrong:", err)
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", string(r.Body))
})
c.Visit("https://forbes.com/real-time-billionaires/")
}
This is my code, when i requested i am getting the fallback page ,This is the link for forbes that i am trying to scrape
I have noticed that the website uses hash path which is at the last part of url and i cannot request with the same url twice, and i think its somehow related to scraping, can anyone help me with this?
答案1
得分: 3
如果在浏览器中禁用 JavaScript(可以使用开发者工具进行操作),请确保查看可用的内容。大多数网络爬虫只能获取页面的文本表示,而浏览器还会对其运行 JavaScript 引擎。如果你尝试爬取的数据是由 JavaScript 填充的,很有可能这就是你无法进行爬取的原因。
英文:
Make sure what is available if you disable javascript in your browser (you can do it using the developer tools). Most scrapers will only get you the textual representation of the page, while the browser will also run javascript engine against it. If the data you are trying to scrape is populated with Javascript, there is a very good chance that is the reason you can't scrape it.
答案2
得分: 0
Colly只能用于静态爬取,chromedp可以用于爬取客户端渲染的应用程序。
英文:
Colly can only be used for static scraping, chromedp can be used for scraping client side rendered applications.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论