英文:
How to choose a good scraper based on content type?
问题
我想根据给定的URL选择一个网页抓取器。
但问题是,有时候BeautifulSoup无法抓取一些受JavaScript保护的页面。
在这种情况下,我使用Selenium而不是BeautifulSoup。
但我想知道如何检测网站的内容类型,并根据内容类型自动选择正确的抓取器。
请给我一个处理这个任务的方法。
英文:
I want to choose a scraper based on a given URL.
but the problem is sometimes beautifulsoup is unable to scrape some js protected pages.
In this situation I use selenium instead of bs4.
but I want to know how to detect content type in websites and based on that type choose right scraper automatically.
Please give me an approach to do this task.
答案1
得分: 1
要选择一个高效的抓取工具,您需要深入了解由浏览客户端呈现的HTML DOM。
解决方案
如果您需要抓取具有静态元素的页面,Beautifulsoup将提供所需的精度和性能。
但是,如果页面上的元素是通过以下方式动态生成的:
- JavaScript
- AngularJS
- ReactJS
- Vue.js
- Ember.js
- React Native
- AJAX
- jQuery
那么您需要允许动态组件在DOM树中进行渲染。在这些情况下,使用Selenium是没有更好的方法的。
英文:
To choose an efficient scraping tool you need to have a deeper look into the HTML DOM rendered by the browsing client.
Solution
If you have to scrape pages with static elements, Beautifulsoup would provide the much needed precision and performance.
But if the elements on the page are dynamically generated either through:
Then you need to allow the dynamic components to get rendered within the DOM Tree. In those cases, there can't be any better approach then using Selenium.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论