可以使用gocolly爬取CSR网站吗?

huangapple go评论92阅读模式
英文:

is it possible crawl CSR website using gocolly

问题

使用gocolly库可以爬取CSR(客户端渲染/JS)网站吗?我需要爬取许多网站,为此,我在数据库中有一个titleXpath,如下所示:

c.OnXML(titleXpath, func(e *colly.XMLElement) {
   data = append(data, e.Text)
   fmt.Println("title", e.Text)
})

是、否或其他包。

英文:

Is it possible to crawl CSR(Client Side Render/JS) websites using gocolly? I need to crawl many websites, and for that, I have a titleXpath in the database as follows:

c.OnXML(titleXpath, func(e *colly.XMLElement) {
   data = append(data, e.Text)
   fmt.Println("title", e.Text)
})

Yes or no or another package

答案1

得分: 2

使用gocolly单独无法爬取客户端渲染(CSR/JS)的网站。gocolly是一个针对Golang的网络爬虫库,它在HTTP层面操作并解析静态HTML文档,但它无法执行JavaScript。

要爬取CSR网站,你需要一个无头浏览器或支持JavaScript渲染的网络爬虫工具。一些常用的用于爬取CSR网站的选项包括:

  • Puppeteer(与Golang库如chromedp一起使用)
  • Selenium(与Golang库如goselenium一起使用)
英文:

It is not possible to crawl Client-Side Rendered (CSR/JS) websites using gocolly alone. gocolly is a scraping library for Golang that operates at the HTTP level and can parse static HTML documents, but it does not execute JavaScript.

To scrape CSR websites, you need a headless browser or a web scraping tool that supports JavaScript rendering. Some popular options for scraping CSR websites include:

  • Puppeteer (with the Golang library such as chromedp)
  • Selenium (with the Golang library such as goselenium)

huangapple
  • 本文由 发表于 2023年7月25日 11:43:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76759313.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定