2023年7月28日 05:22:09go评论129阅读模式

英文:

How to choose a good scraper based on content type?

问题

我想根据给定的URL选择一个网页抓取器。
但问题是，有时候BeautifulSoup无法抓取一些受JavaScript保护的页面。
在这种情况下，我使用Selenium而不是BeautifulSoup。
但我想知道如何检测网站的内容类型，并根据内容类型自动选择正确的抓取器。
请给我一个处理这个任务的方法。

英文:

I want to choose a scraper based on a given URL.
but the problem is sometimes beautifulsoup is unable to scrape some js protected pages.
In this situation I use selenium instead of bs4.
but I want to know how to detect content type in websites and based on that type choose right scraper automatically.
Please give me an approach to do this task.

答案1

得分: 1

要选择一个高效的抓取工具，您需要深入了解由浏览客户端呈现的HTML DOM。

解决方案
如果您需要抓取具有静态元素的页面，Beautifulsoup将提供所需的精度和性能。

但是，如果页面上的元素是通过以下方式动态生成的：

JavaScript
AngularJS
ReactJS
Vue.js
Ember.js
React Native
AJAX
jQuery

那么您需要允许动态组件在DOM树中进行渲染。在这些情况下，使用Selenium是没有更好的方法的。

英文:

To choose an efficient scraping tool you need to have a deeper look into the HTML DOM rendered by the browsing client.

Solution

If you have to scrape pages with static elements, Beautifulsoup would provide the much needed precision and performance.

But if the elements on the page are dynamically generated either through:

Then you need to allow the dynamic components to get rendered within the DOM Tree. In those cases, there can't be any better approach then using Selenium.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何根据内容类型选择一个好的爬虫？

问题

答案1

Solution

非贪婪正则表达式返回错误结果。

Python正则表达式提取较大字符串中的主题标签

如何使用Python3从网站提取表格

使用正则表达式从Google Lens响应中找到符合模式的文本。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。