使用分页提取信息-使用selenium、bs4和python

huangapple go评论130阅读模式
英文:

Extract info using pagination- selenium bs4 python

问题

我正在使用网络爬虫的销售导航器。我已经能够导航到第一页,滚动8次并使用selenium和beautiful提取所有的姓名和职位。以下是代码。

  1. driver.get(dm)
  2. time.sleep(5)
  3. time.sleep(5)
  4. section = driver.find_element(By.XPATH, "//[@id='search-results-container']")
  5. time.sleep(5)
  6. counter = 0
  7. while counter < 8: # 这将滚动8次
  8. driver.execute_script('arguments[0].scrollTop = arguments[0].scrollTop + arguments[0].offsetHeight;',
  9. section)
  10. counter += 1
  11. # 在滚动部分后添加一个计时器,以确保数据完全加载
  12. time.sleep(7) # 你可能需要安装time库来使用这个语句
  13. src2 = driver.page_source
  14. # 现在使用beautiful soup
  15. soup = BeautifulSoup(src2, 'lxml')
  16. name_soup = soup.find_all('span', {'data-anonymize': 'person-name'})
  17. names = []
  18. for name in name_soup:
  19. names.append(name.text.strip())

然而,还有8个页面,我需要提取所有的姓名并将其添加到names列表中。

请帮忙。

英文:

I am working with web scraping sales navigator. I was able to navigate to 1st page, scroll 8 times and extract all the names, titles using selenium and beautiful. below is the code.

  1. driver.get(dm)
  2. time.sleep(5)
  3. time.sleep(5)
  4. section = driver.find_element(By.XPATH, &quot;//*[@id=&#39;search-results-container&#39;]&quot;)
  5. time.sleep(5)
  6. counter = 0
  7. while counter &lt; 8: # this will scroll 8 times
  8. driver.execute_script(&#39;arguments[0].scrollTop = arguments[0].scrollTop + arguments[0].offsetHeight;&#39;,
  9. section)
  10. counter += 1
  11. # add a timer for the data to fully load once you have scrolled the section
  12. time.sleep(7) # You might need to install time library to use this statement
  13. src2 = driver.page_source
  14. # Now using beautiful soup
  15. soup = BeautifulSoup(src2, &#39;lxml&#39;)
  16. name_soup = soup.find_all(&#39;span&#39;, {&#39;data-anonymize&#39;: &#39;person-name&#39;})
  17. names = []
  18. for name in name_soup:
  19. names.append(name.text.strip())

However, there are 8 more pages and I need to extract all the names and append it to names list.

Please help

答案1

得分: 0

通常,我在分页时使用的逻辑是:

  1. while True:
  2. ## 页面抓取代码 [即,您当前的代码]
  3. ## 搜索下一页 [按钮/链接]
  4. ### 如果有下一页 --&gt; 点击按钮或跳转链接
  5. ### 没有下一页 --&gt; 退出循环

如果您提供了您尝试抓取的链接,我可能能够给您一个更具体的答案。例如,这个 是我经常用来抓取分页数据的函数,尽管它不适用于可滚动的页面...。

英文:

Generally, the logic I use for pagination is

  1. while True:
  2. ## PAGE SCRAPING CODE [ie, your current code]
  3. ## SEARCH FOR NEXT PAGE
  4. ### IF NEXT PAGE --&gt; click button or go to link
  5. ### NO NEXT PAGE --&gt; BREAK

If you included the link you're trying to scrape, I might be able to give you a more specific answer. For example, this is a function I often use to scrape paginated data, although it's not meant to be for scrollable pages....

huangapple
  • 本文由 发表于 2023年7月27日 18:00:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76778598.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定