如何使用Selenium提高网页抓取速度?

huangapple go评论72阅读模式
英文:

How to improve webscraping speed using Selenium?

问题

为了提高效率,你可以考虑使用显式等待(explicit wait)来替代 time.sleep。显式等待可以更精确地等待特定条件的出现,而不需要硬性的等待固定的时间。这可以帮助你的代码更加灵活和高效。

以下是如何在你的代码中使用显式等待的示例:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# ...

# 在页面加载完成之前等待5秒
wait = WebDriverWait(driver, 5)

# 导航到目标页面
driver.get('https://www.target.com/s?searchTerm=' + dpci)

# 使用显式等待等待元素出现并点击
try:
    glink = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="pageBodyContainer"]/div[1]/div/div[4]/div/div/div[2]/div/section/div/div/div/div/div[1]/div[2]/div/div/div[1]/div[1]/div[1]/a')))
    print(type(glink))
    # 点击第一个链接
    if glink:
        glink[0].click()
except Exception as e:
    print("Error:", e)
    continue

# ...

# 在需要等待的地方使用显式等待
try:
    load_more_button = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.h-text-center .h-padding-v-tight')))
    if "Load" in load_more_button.get_attribute('innerHTML'):
        load_more_button.click()
        time.sleep(3)
except Exception as e:
    print("Error:", e)

# ...

通过使用显式等待,你可以更精确地等待元素的出现,而不必等待固定的时间。这将帮助你的代码更加稳定和高效。

英文:

I have made a python script for webscraping product information from Target using Selenium. Since there are a lot of products I have added a loop to iterate the process. Unfortunately this process is too slow and I was wondering if there were ways to improve the efficiency i.e use wait times etc.

For each iteration the information of the products is appended to a list 'lstTitles'. Below I have attached part of the code:

options = Options()
options.add_argument('--headless=new')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
# loop on excel dataset
lstReview = []
name='Target' #for now
x_lst=[]
for i in files:
    dpci = i
    print(dpci)
    # variables
    lstTitles = []
    OverallRating = ""

    # navigate to the target page
    driver.get('https://www.target.com/s?searchTerm=' + dpci)
    time.sleep(5)
    glink=driver.find_elements(by=By.XPATH, value='//*[@id="pageBodyContainer"]/div[1]/div/div[4]/div/div/div[2]/div/section/div/div/div/div/div[1]/div[2]/div/div/div[1]/div[1]/div[1]/a');
    print(type(glink))
    # first link open
    flag=0
    for i in glink:
        print(i.get_attribute('href'))
        i.click()
        flag=1
        break
    if(flag==0):
        continue #if link not found then go to next dpci


    driver.execute_script("window.scrollTo(0, 1000)")
    time.sleep(3)
    driver.execute_script("window.scrollTo(1000, 2000)")
    time.sleep(3)
    driver.execute_script("window.scrollTo(3000, 4000)")
    time.sleep(3)
    driver.execute_script("window.scrollTo(4000, 5000)")
    #input("TEST")
    loadMoreFound = "Yes"
    #dtlink = driver.find_elements(by=By.CSS_SELECTOR('.h-text-center.h-padding-v-tight')[0])
    # Click on Load More Button
    while loadMoreFound == "Yes":
        loadMoreFound = "No"
        try:
            print("load more button")
            clink=driver.find_elements(by=By.CSS_SELECTOR, value='.h-text-center .h-padding-v-tight')
            #print(len(clink))
            #print(type(clink))
            #print(clink)

            for loadmore in clink:
#                 print("in the for")
                t = loadmore.get_attribute('innerHTML')
#                 print(t)
                #input("load more")
                if "Load" in t:
#                     print("Load more button found")
                    loadMoreFound = "Yes"
                    loadmore.click()
                    time.sleep(3)
                    break
        except:
            print("Error1")
            loadMoreFound = "No"
    counter = 0
    
    # Title
    time.sleep(3)
    ReviewTitle=driver.find_elements(by=By.CLASS_NAME, value='jfrTHg')
    #print(str(len(ReviewTitle)))
    # Get Reviews
    for i in ReviewTitle:
        t = i.get_attribute('innerText')
        #print(t)
        lstTitles.append(t)
        #input("get")
    print("Moving to next DPCI")

I have added time.sleep wherever a new element was to be located or clicked. Apart from that my intuition behind wait times is not very solid so I am not sure whether to add an explicit wait or implicit wait to my code. Any feedback would highly appreciated!

答案1

得分: 2

  1. time.sleep() 不是在Selenium中处理等待的最佳方式。相反,使用Selenium内置的WebDriverWait类。

示例:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

# 在元素被找到之前最多等待10秒,否则会抛出TimeoutException
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'element_id')))
  1. 当您知道页面上只有一个特定元素时,使用find_element_by_... 而不是 find_elements_by_...(例如,当获取 'glink' 时)。这样稍微更高效,因为Selenium在找到第一个匹配项后就会停止查找。

  2. 您当前的滚动模式涉及固定的距离和等待时间。相反,尝试滚动直到找到某个特定元素或者无法再滚动为止。

  3. 而不是不断地检查“加载更多”按钮,您可以设置一个循环,当按钮找不到或无法点击时停止。

  4. 如果您要独立地抓取多个页面,您可以在Python中使用多进程同时抓取多个页面。

英文:
  1. time.sleep() is not the best way to handle waiting in Selenium. Instead, use Selenium's built-in WebDriverWait class

Example:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

# Wait up to 10 seconds before throwing a TimeoutException unless the element is found
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'element_id')))
  1. When you know there's only one of a certain element on the page, use find_element_by_... instead of find_elements_by_... (e.g., when getting 'glink'). This is slightly more efficient because Selenium stops looking as soon as it finds the first match.

  2. Your current scrolling pattern involves fixed distances and wait
    times. Instead, try scrolling until a certain element is found or
    until you can't scroll any further.

  3. Instead of continuously checking for the 'Load more' button, you can
    set up a loop that stops when the button isn't found or can't be
    clicked.

  4. If you're scraping multiple pages independently,
    you can make use of multiprocessing in Python to scrape multiple
    pages at the same time.

huangapple
  • 本文由 发表于 2023年5月29日 12:53:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76354760.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定