英文:
is it possible crawl CSR website using gocolly
问题
使用gocolly库可以爬取CSR(客户端渲染/JS)网站吗?我需要爬取许多网站,为此,我在数据库中有一个titleXpath,如下所示:
c.OnXML(titleXpath, func(e *colly.XMLElement) {
data = append(data, e.Text)
fmt.Println("title", e.Text)
})
是、否或其他包。
英文:
Is it possible to crawl CSR(Client Side Render/JS) websites using gocolly? I need to crawl many websites, and for that, I have a titleXpath in the database as follows:
c.OnXML(titleXpath, func(e *colly.XMLElement) {
data = append(data, e.Text)
fmt.Println("title", e.Text)
})
Yes or no or another package
答案1
得分: 2
使用gocolly单独无法爬取客户端渲染(CSR/JS)的网站。gocolly是一个针对Golang的网络爬虫库,它在HTTP层面操作并解析静态HTML文档,但它无法执行JavaScript。
要爬取CSR网站,你需要一个无头浏览器或支持JavaScript渲染的网络爬虫工具。一些常用的用于爬取CSR网站的选项包括:
- Puppeteer(与Golang库如chromedp一起使用)
- Selenium(与Golang库如goselenium一起使用)
英文:
It is not possible to crawl Client-Side Rendered (CSR/JS) websites using gocolly alone. gocolly is a scraping library for Golang that operates at the HTTP level and can parse static HTML documents, but it does not execute JavaScript.
To scrape CSR websites, you need a headless browser or a web scraping tool that supports JavaScript rendering. Some popular options for scraping CSR websites include:
- Puppeteer (with the Golang library such as chromedp)
- Selenium (with the Golang library such as goselenium)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论