GoLang 爬虫。如何爬取动态生成的网站链接?

huangapple go评论80阅读模式
英文:

GoLang Scraper. How to scrape dynamically generated links on a website?

问题

我正在尝试爬取产品视频链接(这些链接是由另一个网络服务动态生成的,位置位于左侧的产品图片下方)。你可以查看以下链接:
https://www.tokopedia.com/chocoapple/ready-stock-bnib-iphone-128gb-7-plus-jet-black-garansi-apple-1-tahun-10?src=topads
Google Chrome的“检查元素”功能显示了div标签。但是在页面源代码中找不到相同的标签。
如何实现这个功能?我正在研究使用goQuery来完成这个任务,但不确定它是否有效。我不是一个网页开发者,所以如果我的问题描述不够具体,请给予建议。
谢谢。

英文:

I am trying to scrape product video links (which are generated dynamically by another web service. The location is under the product images on the left side). You can check following link,
https://www.tokopedia.com/chocoapple/ready-stock-bnib-iphone-128gb-7-plus-jet-black-garansi-apple-1-tahun-10?src=topads
The google chrome "inspect element" shows the div tag. But The same tag is not present in the page source.
How to do it? I am looking into goQuery to implement the task but not sure will it work or not. I am not a web developer so please consider giving suggestions if my question description is not specific.
Thank you.

答案1

得分: 3

如果标签不在源代码中,那么GoQuery将无法工作。GoQuery是使用类似于jQuery的API解析HTML源代码的工具。

你需要首先使用一个无头的WebKit工具(如phantomjs、chromeless或puppeteer)处理网页。这些工具都可以在处理网页之前处理网页中的所有JavaScript。这样,你感兴趣的视频的AJAX请求就会被处理,并且源代码会被更新。然后,你可以下载相应的源代码,其中应该包含该div标签。

英文:

If the tag is not in the source, then GoQuery will not work. GoQuery is for parsing HTML source using a jQuery-like API.

You need to first process the webpage with a headless WebKit like phantomjs, chromeless, or puppeteer. Each of these tools will allow you to process all the Javascript on the webpage before processing it. This way, the AJAX for rendering the video you are interest in will be processed and the source will be updated. You can then download the corresponding source which should have the div in it.

答案2

得分: 2

你可能需要像浏览器一样评估页面。正如schollz所回答的那样,可以通过所谓的无头浏览器(可通过cli或api使用的浏览器,不显示其图形界面)来实现。

在Go语言世界中,有一个叫做chromedp的工具。

https://github.com/knq/chromedp

https://www.youtube.com/watch?v=_7pWCg94sKw

英文:

You probably need to evaluate the page like a browser does. As schollz answered it, this is possible via so called headless browser (browsers usable via the cli or an api, which does not show their gui).

In go world there is chromedp

https://github.com/knq/chromedp

https://www.youtube.com/watch?v=_7pWCg94sKw

答案3

得分: 1

请找到下一个标签<img class="thumbnail-img horizontal" src="//i.ytimg.com/vi/oKR2fh09Nic/mqdefault.jpg">。如您所见,src包含IDoKR2fh09Nic。这是需要的路径https://www.youtube.com/watch?v=oKR2fh09Nic

此外,您可以使用http://youtube.com/get_video_info?video_id=oKR2fh09Nic来加载视频信息。

示例在这里https://github.com/kkdai/youtube/blob/master/youtube.go

英文:

Please find the next tag <img class="thumbnail-img horizontal" src="//i.ytimg.com/vi/oKR2fh09Nic/mqdefault.jpg">. As you see src contain ID "oKR2fh09Nic". This is need path https://www.youtube.com/watch?v=oKR2fh09Nic

Also, you can use http://youtube.com/get_video_info?video_id= oKR2fh09Nic for loading video information.

Example here https://github.com/kkdai/youtube/blob/master/youtube.go

huangapple
  • 本文由 发表于 2017年8月27日 21:39:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/45905550.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定