2023年7月27日 18:00:36go评论130阅读模式

英文:

Extract info using pagination- selenium bs4 python

问题

我正在使用网络爬虫的销售导航器。我已经能够导航到第一页，滚动8次并使用selenium和beautiful提取所有的姓名和职位。以下是代码。

driver.get(dm)
time.sleep(5)
time.sleep(5)
section = driver.find_element(By.XPATH, "//[@id='search-results-container']")
time.sleep(5)
counter = 0
while counter < 8:  # 这将滚动8次
    driver.execute_script('arguments[0].scrollTop = arguments[0].scrollTop + arguments[0].offsetHeight;',
                                 section)
    counter += 1
    # 在滚动部分后添加一个计时器，以确保数据完全加载
    time.sleep(7) # 你可能需要安装time库来使用这个语句
src2 = driver.page_source
 
# 现在使用beautiful soup
soup = BeautifulSoup(src2, 'lxml')
name_soup = soup.find_all('span', {'data-anonymize': 'person-name'})
names = []
for name in name_soup:
    names.append(name.text.strip())

然而，还有8个页面，我需要提取所有的姓名并将其添加到names列表中。

请帮忙。

英文:

I am working with web scraping sales navigator. I was able to navigate to 1st page, scroll 8 times and extract all the names, titles using selenium and beautiful. below is the code.

driver.get(dm)
time.sleep(5)
time.sleep(5)
section = driver.find_element(By.XPATH, &quot;//*[@id=&#39;search-results-container&#39;]&quot;)
time.sleep(5)
counter = 0
while counter &lt; 8:  # this will scroll 8 times
    driver.execute_script(&#39;arguments[0].scrollTop = arguments[0].scrollTop + arguments[0].offsetHeight;&#39;,
                                 section)
    counter += 1
    # add a timer for the data to fully load once you have scrolled the section
    time.sleep(7) # You might need to install time library to use this statement
src2 = driver.page_source
 
# Now using beautiful soup
soup = BeautifulSoup(src2, &#39;lxml&#39;)
name_soup = soup.find_all(&#39;span&#39;, {&#39;data-anonymize&#39;: &#39;person-name&#39;})
names = []
for name in name_soup:
    names.append(name.text.strip())

However, there are 8 more pages and I need to extract all the names and append it to names list.

Please help

答案1

得分: 0

通常，我在分页时使用的逻辑是：

while True:
    ## 页面抓取代码 [即，您当前的代码]
    
    ## 搜索下一页 [按钮/链接]
    ### 如果有下一页 --&gt; 点击按钮或跳转链接
    ### 没有下一页 --&gt; 退出循环

如果您提供了您尝试抓取的链接，我可能能够给您一个更具体的答案。例如，这个是我经常用来抓取分页数据的函数，尽管它不适用于可滚动的页面...。

英文:

Generally, the logic I use for pagination is

while True:
    ## PAGE SCRAPING CODE [ie, your current code]
    
    ## SEARCH FOR NEXT PAGE 
    ### IF NEXT PAGE --&gt; click button or go to link
    ### NO NEXT PAGE --&gt; BREAK

If you included the link you're trying to scrape, I might be able to give you a more specific answer. For example, this is a function I often use to scrape paginated data, although it's not meant to be for scrollable pages....

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用分页提取信息-使用selenium、bs4和python

问题

答案1

AWS –extra-py-files 抛出 ModuleNotFoundError: No module named ‘pg8000’

Im getting a Index Error, when i tried to make a dummy console game with basic oop

循环以在Python中找到最大的R2

无法在C#中使用Selenium禁用硬件加速。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。