使用分页提取信息-使用selenium、bs4和python

huangapple go评论99阅读模式
英文:

Extract info using pagination- selenium bs4 python

问题

我正在使用网络爬虫的销售导航器。我已经能够导航到第一页,滚动8次并使用selenium和beautiful提取所有的姓名和职位。以下是代码。

driver.get(dm)
time.sleep(5)

time.sleep(5)
section = driver.find_element(By.XPATH, "//[@id='search-results-container']")
time.sleep(5)

counter = 0

while counter < 8:  # 这将滚动8次
    driver.execute_script('arguments[0].scrollTop = arguments[0].scrollTop + arguments[0].offsetHeight;',
                                 section)
    counter += 1
    # 在滚动部分后添加一个计时器,以确保数据完全加载

    time.sleep(7) # 你可能需要安装time库来使用这个语句

src2 = driver.page_source
 
# 现在使用beautiful soup
soup = BeautifulSoup(src2, 'lxml')

name_soup = soup.find_all('span', {'data-anonymize': 'person-name'})

names = []
for name in name_soup:
    names.append(name.text.strip())
    

然而,还有8个页面,我需要提取所有的姓名并将其添加到names列表中。

请帮忙。

英文:

I am working with web scraping sales navigator. I was able to navigate to 1st page, scroll 8 times and extract all the names, titles using selenium and beautiful. below is the code.

driver.get(dm)
time.sleep(5)

time.sleep(5)
section = driver.find_element(By.XPATH, &quot;//*[@id=&#39;search-results-container&#39;]&quot;)
time.sleep(5)

counter = 0

while counter &lt; 8:  # this will scroll 8 times
    driver.execute_script(&#39;arguments[0].scrollTop = arguments[0].scrollTop + arguments[0].offsetHeight;&#39;,
                                 section)
    counter += 1
    # add a timer for the data to fully load once you have scrolled the section

    time.sleep(7) # You might need to install time library to use this statement

src2 = driver.page_source
 
# Now using beautiful soup
soup = BeautifulSoup(src2, &#39;lxml&#39;)

name_soup = soup.find_all(&#39;span&#39;, {&#39;data-anonymize&#39;: &#39;person-name&#39;})

names = []
for name in name_soup:
    names.append(name.text.strip())
    

However, there are 8 more pages and I need to extract all the names and append it to names list.

Please help

答案1

得分: 0

通常,我在分页时使用的逻辑是:

while True:
    ## 页面抓取代码 [即,您当前的代码]
    
    ## 搜索下一页 [按钮/链接]
    ### 如果有下一页 --&gt; 点击按钮或跳转链接
    ### 没有下一页 --&gt; 退出循环

如果您提供了您尝试抓取的链接,我可能能够给您一个更具体的答案。例如,这个 是我经常用来抓取分页数据的函数,尽管它不适用于可滚动的页面...。

英文:

Generally, the logic I use for pagination is

while True:
    ## PAGE SCRAPING CODE [ie, your current code]
    
    ## SEARCH FOR NEXT PAGE 
### IF NEXT PAGE --&gt; click button or go to link ### NO NEXT PAGE --&gt; BREAK

If you included the link you're trying to scrape, I might be able to give you a more specific answer. For example, this is a function I often use to scrape paginated data, although it's not meant to be for scrollable pages....

huangapple
  • 本文由 发表于 2023年7月27日 18:00:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76778598.html
  • beautifulsoup
  • pagination
  • python-3.x
  • selenium-webdriver
  • web-scraping
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定