英文:
How to improve webscraping speed using Selenium?
问题
为了提高效率,你可以考虑使用显式等待(explicit wait)来替代 time.sleep
。显式等待可以更精确地等待特定条件的出现,而不需要硬性的等待固定的时间。这可以帮助你的代码更加灵活和高效。
以下是如何在你的代码中使用显式等待的示例:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# ...
# 在页面加载完成之前等待5秒
wait = WebDriverWait(driver, 5)
# 导航到目标页面
driver.get('https://www.target.com/s?searchTerm=' + dpci)
# 使用显式等待等待元素出现并点击
try:
glink = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="pageBodyContainer"]/div[1]/div/div[4]/div/div/div[2]/div/section/div/div/div/div/div[1]/div[2]/div/div/div[1]/div[1]/div[1]/a')))
print(type(glink))
# 点击第一个链接
if glink:
glink[0].click()
except Exception as e:
print("Error:", e)
continue
# ...
# 在需要等待的地方使用显式等待
try:
load_more_button = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.h-text-center .h-padding-v-tight')))
if "Load" in load_more_button.get_attribute('innerHTML'):
load_more_button.click()
time.sleep(3)
except Exception as e:
print("Error:", e)
# ...
通过使用显式等待,你可以更精确地等待元素的出现,而不必等待固定的时间。这将帮助你的代码更加稳定和高效。
英文:
I have made a python script for webscraping product information from Target using Selenium. Since there are a lot of products I have added a loop to iterate the process. Unfortunately this process is too slow and I was wondering if there were ways to improve the efficiency i.e use wait times etc.
For each iteration the information of the products is appended to a list 'lstTitles'. Below I have attached part of the code:
options = Options()
options.add_argument('--headless=new')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
# loop on excel dataset
lstReview = []
name='Target' #for now
x_lst=[]
for i in files:
dpci = i
print(dpci)
# variables
lstTitles = []
OverallRating = ""
# navigate to the target page
driver.get('https://www.target.com/s?searchTerm=' + dpci)
time.sleep(5)
glink=driver.find_elements(by=By.XPATH, value='//*[@id="pageBodyContainer"]/div[1]/div/div[4]/div/div/div[2]/div/section/div/div/div/div/div[1]/div[2]/div/div/div[1]/div[1]/div[1]/a');
print(type(glink))
# first link open
flag=0
for i in glink:
print(i.get_attribute('href'))
i.click()
flag=1
break
if(flag==0):
continue #if link not found then go to next dpci
driver.execute_script("window.scrollTo(0, 1000)")
time.sleep(3)
driver.execute_script("window.scrollTo(1000, 2000)")
time.sleep(3)
driver.execute_script("window.scrollTo(3000, 4000)")
time.sleep(3)
driver.execute_script("window.scrollTo(4000, 5000)")
#input("TEST")
loadMoreFound = "Yes"
#dtlink = driver.find_elements(by=By.CSS_SELECTOR('.h-text-center.h-padding-v-tight')[0])
# Click on Load More Button
while loadMoreFound == "Yes":
loadMoreFound = "No"
try:
print("load more button")
clink=driver.find_elements(by=By.CSS_SELECTOR, value='.h-text-center .h-padding-v-tight')
#print(len(clink))
#print(type(clink))
#print(clink)
for loadmore in clink:
# print("in the for")
t = loadmore.get_attribute('innerHTML')
# print(t)
#input("load more")
if "Load" in t:
# print("Load more button found")
loadMoreFound = "Yes"
loadmore.click()
time.sleep(3)
break
except:
print("Error1")
loadMoreFound = "No"
counter = 0
# Title
time.sleep(3)
ReviewTitle=driver.find_elements(by=By.CLASS_NAME, value='jfrTHg')
#print(str(len(ReviewTitle)))
# Get Reviews
for i in ReviewTitle:
t = i.get_attribute('innerText')
#print(t)
lstTitles.append(t)
#input("get")
print("Moving to next DPCI")
I have added time.sleep
wherever a new element was to be located or clicked. Apart from that my intuition behind wait times is not very solid so I am not sure whether to add an explicit wait or implicit wait to my code. Any feedback would highly appreciated!
答案1
得分: 2
time.sleep()
不是在Selenium中处理等待的最佳方式。相反,使用Selenium内置的WebDriverWait
类。
示例:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
# 在元素被找到之前最多等待10秒,否则会抛出TimeoutException
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'element_id')))
-
当您知道页面上只有一个特定元素时,使用
find_element_by_...
而不是find_elements_by_...
(例如,当获取 'glink' 时)。这样稍微更高效,因为Selenium在找到第一个匹配项后就会停止查找。 -
您当前的滚动模式涉及固定的距离和等待时间。相反,尝试滚动直到找到某个特定元素或者无法再滚动为止。
-
而不是不断地检查“加载更多”按钮,您可以设置一个循环,当按钮找不到或无法点击时停止。
-
如果您要独立地抓取多个页面,您可以在Python中使用多进程同时抓取多个页面。
英文:
- time.sleep() is not the best way to handle waiting in Selenium. Instead, use Selenium's built-in WebDriverWait class
Example:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
# Wait up to 10 seconds before throwing a TimeoutException unless the element is found
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'element_id')))
-
When you know there's only one of a certain element on the page, use find_element_by_... instead of find_elements_by_... (e.g., when getting 'glink'). This is slightly more efficient because Selenium stops looking as soon as it finds the first match.
-
Your current scrolling pattern involves fixed distances and wait
times. Instead, try scrolling until a certain element is found or
until you can't scroll any further. -
Instead of continuously checking for the 'Load more' button, you can
set up a loop that stops when the button isn't found or can't be
clicked. -
If you're scraping multiple pages independently,
you can make use of multiprocessing in Python to scrape multiple
pages at the same time.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论