无法使用Selenium从网站提取数据。

huangapple go评论105阅读模式
英文:

Could not extract data from website using selenium

问题

我正在尝试从<https://cargillsonline.com/Web/Product?IC=Mg==&NC=QmFieSBQcm9kdWN0cw==>提取数据,其中包含多个子页面。但是每个子页面没有单独的链接用于提取数据。

所以我使用selenium来动态加载网站并导航到每个页面。但当我尝试从第二个页面提取数据时,它只返回第一个页面的内容。

这是我用来运行程序的代码。

  1. from bs4 import BeautifulSoup
  2. import requests
  3. import pandas as pd
  4. from urllib3.exceptions import InsecureRequestWarning
  5. from urllib3 import disable_warnings
  6. import time
  7. from pathlib import Path
  8. disable_warnings(InsecureRequestWarning)
  9. agent = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
  10. Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50",}
  11. from selenium import webdriver
  12. from selenium.webdriver.chrome.service import Service
  13. from selenium.webdriver.common.by import By
  14. from selenium.webdriver.support.ui import WebDriverWait
  15. from selenium.webdriver.support import expected_conditions as EC
  16. url='https://cargillsonline.com/Web/Product?IC=Mg==&amp;NC=QmFieSBQcm9kdWN0cw=='
  17. path='C:/Users/dell/Desktop/Data/DataScraping/chrome_driver/chromedriver'
  18. service = Service(path)
  19. driver = webdriver.Chrome(service=service)
  20. driver.get(url)
  21. html = driver.page_source
  22. soup = BeautifulSoup(html, 'html.parser')
  23. def get_data():
  24. start = time.process_time()
  25. url=main_url
  26. product_name=[]
  27. product_price=[]
  28. count=0
  29. all_pages=10 #这个数字仅用于测试目的
  30. print('Get Data Processing .....')
  31. for i in range(all_pages):
  32. if(count==0):
  33. add_boxs_v1=soup.find_all(class_='veg')
  34. for product in add_boxs_v1:
  35. product_name.append(product.find('p').text)
  36. add_boxs_v2=soup.find_all(class_='strike1')
  37. for price in add_boxs_v2:
  38. product_price.append(price.find('h4').text)
  39. count+=1
  40. WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[@ng-click='selectPage(page + 1, $event)']")).click()
  41. time.sleep(5)
  42. print('done')
  43. df=pd.DataFrame({'Product_name':product_name,'Price':product_price})
  44. return df
  45. df=get_data()
  46. df.head()

请问,有人能指导我在这个过程中哪一步出了问题吗?

英文:

I am trying to extract data from <https://cargillsonline.com/Web/Product?IC=Mg==&NC=QmFieSBQcm9kdWN0cw==>, and it contains multiple sub-pages. But each sub-page doesn't have a separate link to extra.

So I use selenium to dynamically load the website and navigate to each page. But when I try to extract data from the second page, it returns only the first page's content.

This is the code that I used to run the programme.

  1. from bs4 import BeautifulSoup
  2. import requests
  3. import pandas as pd
  4. from urllib3.exceptions import InsecureRequestWarning
  5. from urllib3 import disable_warnings
  6. import time
  7. from pathlib import Path
  8. disable_warnings(InsecureRequestWarning)
  9. agent = {&quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
  10. Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50&quot;,}
  11. from selenium import webdriver
  12. from selenium.webdriver.chrome.service import Service
  13. from selenium.webdriver.common.by import By
  14. from selenium.webdriver.support.ui import WebDriverWait
  15. from selenium.webdriver.support import expected_conditions as EC
  16. url=&#39;https://cargillsonline.com/Web/Product?IC=Mg==&amp;NC=QmFieSBQcm9kdWN0cw==&#39;
  17. path=&#39;C:/Users/dell/Desktop/Data/DataScraping/chrome_driver/chromedriver
  18. service = Service(path)
  19. driver = webdriver.Chrome(service=service)
  20. driver.get(url)
  21. html = driver.page_source
  22. soup = BeautifulSoup(html, &#39;html.parser&#39;)
  23. def get_data():
  24. start = time.process_time()
  25. url=main_url
  26. product_name=[]
  27. product_price=[]
  28. count=0
  29. all_pages=10 #this number is only for testing purpose
  30. print(&#39;Get Data Processing .....&#39;)
  31. for i in range(all_pages):
  32. if(count==0):
  33. add_boxs_v1=soup.find_all(class_=&#39;veg&#39;)
  34. for product in add_boxs_v1:
  35. product_name.append(product.find(&#39;p&#39;).text)
  36. add_boxs_v2=soup.find_all(class_=&#39;strike1&#39;)
  37. for price in add_boxs_v2:
  38. product_price.append(price.find(&#39;h4&#39;).text)
  39. count+=1
  40. WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, &quot;//a[@ng-click=&#39;selectPage(page + 1, $event)&#39;]&quot;))).click()
  41. time.sleep(5)
  42. print(&#39;done&#39;)
  43. df=pd.DataFrame({&#39;Product_name&#39;:product_name,&#39;Price&#39;:product_price})
  44. return df
  45. df=get_data()
  46. df.head()

Please, someone could guide me on what step I did wrong in this process.

答案1

得分: 2

你只会获得第一页,因为你的 page_source 只包含第一页的内容。在每次点击操作时,你需要捕获当前的 page_source

你需要将 page_source 移到循环内部,以获取最新的 page_source

  1. url = 'https://cargillsonline.com/Web/Product?IC=Mg==&NC=QmFieSBQcm9kdWN0cw=='
  2. path = 'C:/Users/dell/Desktop/Data/DataScraping/chrome_driver/chromedriver'
  3. service = Service(path)
  4. driver = webdriver.Chrome(service=service)
  5. driver.get(url)
  6. def get_data():
  7. start = time.process_time()
  8. url = main_url
  9. product_name = []
  10. product_price = []
  11. count = 0
  12. all_pages = 10 # 仅用于测试目的的数字
  13. print('Get Data Processing .....')
  14. for i in range(all_pages):
  15. if count == 0:
  16. html = driver.page_source
  17. soup = BeautifulSoup(html, 'html.parser')
  18. add_boxs_v1 = soup.find_all(class_='veg')
  19. for product in add_boxs_v1:
  20. product_name.append(product.find('p').text)
  21. add_boxs_v2 = soup.find_all(class_='strike1')
  22. for price in add_boxs_v2:
  23. product_price.append(price.find('h4').text)
  24. count += 1
  25. WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[@ng-click='selectPage(page + 1, $event)']")).click()
  26. time.sleep(5)
  27. print('done')
  28. df = pd.DataFrame({'Product_name': product_name, 'Price': product_price})
  29. return df
  30. df = get_data()
  31. df.head()
英文:

you are getting only the first page, because your page_source is first page only. On every click operation you need to capture the current page_source.

you need to move the page_source inside the for loop to get the latest page_source everytime.

  1. url=&#39;https://cargillsonline.com/Web/Product?IC=Mg==&amp;NC=QmFieSBQcm9kdWN0cw==&#39;
  2. path=&#39;C:/Users/dell/Desktop/Data/DataScraping/chrome_driver/chromedriver
  3. service = Service(path)
  4. driver = webdriver.Chrome(service=service)
  5. driver.get(url)
  6. def get_data():
  7. start = time.process_time()
  8. url=main_url
  9. product_name=[]
  10. product_price=[]
  11. count=0
  12. all_pages=10 #this number is only for testing purpose
  13. print(&#39;Get Data Processing .....&#39;)
  14. for i in range(all_pages):
  15. if(count==0):
  16. html = driver.page_source
  17. soup = BeautifulSoup(html, &#39;html.parser&#39;)
  18. add_boxs_v1=soup.find_all(class_=&#39;veg&#39;)
  19. for product in add_boxs_v1:
  20. product_name.append(product.find(&#39;p&#39;).text)
  21. add_boxs_v2=soup.find_all(class_=&#39;strike1&#39;)
  22. for price in add_boxs_v2:
  23. product_price.append(price.find(&#39;h4&#39;).text)
  24. count+=1
  25. WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, &quot;//a[@ng-click=&#39;selectPage(page + 1, $event)&#39;]&quot;))).click()
  26. time.sleep(5)
  27. print(&#39;done&#39;)
  28. df=pd.DataFrame({&#39;Product_name&#39;:product_name,&#39;Price&#39;:product_price})
  29. return df
  30. df=get_data()
  31. df.head()

答案2

得分: 1

以下是从您的URL移至第2页的简单代码。其他页面具有相同的CSS模式,因此无需担心。我测试过,它有效。您只需要集成从您的代码中提取的数据部分。

  1. from selenium import webdriver
  2. from webdriver_manager.chrome import ChromeDriverManager
  3. from selenium.webdriver.common.by import By
  4. import time
  5. # 创建一个新的Chrome浏览器实例
  6. browser = webdriver.Chrome(ChromeDriverManager().install())
  7. # 导航到网站
  8. browser.get("https://cargillsonline.com/Web/Product?IC=Mg==&NC=QmFieSBQcm9kdWN0cw==")
  9. time.sleep(10)
  10. # 获取按钮(第2页)的CSS路径。
  11. # 对于每个页面,您将增加li:nth-child(4)中的数字。
  12. # 例如,li:nth-child(5),li:nth-child(6)
  13. css_path = """
  14. #divProducts > div.divPagingProd > ul > li:nth-child(4) > a
  15. """
  16. # 查找按钮并滚动到该按钮,然后单击
  17. button = browser.find_element(By.CSS_SELECTOR, css_path)
  18. browser.execute_script("arguments[0].scrollIntoView();", button)
  19. browser.execute_script("arguments[0].click();", button)
  20. time.sleep(10)
  21. browser.quit()

希望这对您有所帮助。

英文:

Below is my simple code to move to page 2 from your URL. Other pages have the same CSS pattern so you don't need to worry. I test and it works. You just need to integrate the extracted data part from your code.

  1. from selenium import webdriver
  2. from webdriver_manager.chrome import ChromeDriverManager
  3. from selenium.webdriver.common.by import By
  4. import time
  5. # create a new Chrome browser instance
  6. browser = webdriver.Chrome(ChromeDriverManager().install())
  7. # navigate to the website
  8. browser.get(&quot;https://cargillsonline.com/Web/Product?IC=Mg==&amp;NC=QmFieSBQcm9kdWN0cw==&quot;)
  9. time.sleep(10)
  10. # Get the css_path of the button (page 2).
  11. # For each page you will increase the number inside li:nth-child(4)
  12. # For example li:nth-child(5), li:nth-child(6)
  13. css_path = &quot;&quot;&quot;
  14. #divProducts &gt; div.divPagingProd &gt; ul &gt; li:nth-child(4) &gt; a
  15. &quot;&quot;&quot;
  16. # Find the button and scroll down to that button, then click
  17. button = browser.find_element(By.CSS_SELECTOR, css_path)
  18. browser.execute_script(&quot;arguments[0].scrollIntoView();&quot;, button)
  19. browser.execute_script(&quot;arguments[0].click();&quot;, button)
  20. time.sleep(10)
  21. browser.quit()

huangapple
  • 本文由 发表于 2023年3月9日 18:09:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/75683116.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定