2023年5月29日 12:53:31go评论107阅读模式

英文:

How to improve webscraping speed using Selenium?

问题

为了提高效率，你可以考虑使用显式等待（explicit wait）来替代 time.sleep。显式等待可以更精确地等待特定条件的出现，而不需要硬性的等待固定的时间。这可以帮助你的代码更加灵活和高效。

以下是如何在你的代码中使用显式等待的示例：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# ...
# 在页面加载完成之前等待5秒
wait = WebDriverWait(driver, 5)
# 导航到目标页面
driver.get('https://www.target.com/s?searchTerm=' + dpci)
# 使用显式等待等待元素出现并点击
try:
    glink = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="pageBodyContainer"]/div[1]/div/div[4]/div/div/div[2]/div/section/div/div/div/div/div[1]/div[2]/div/div/div[1]/div[1]/div[1]/a')))
    print(type(glink))
    # 点击第一个链接
    if glink:
        glink[0].click()
except Exception as e:
    print("Error:", e)
    continue
# ...
# 在需要等待的地方使用显式等待
try:
    load_more_button = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.h-text-center .h-padding-v-tight')))
    if "Load" in load_more_button.get_attribute('innerHTML'):
        load_more_button.click()
        time.sleep(3)
except Exception as e:
    print("Error:", e)
# ...

通过使用显式等待，你可以更精确地等待元素的出现，而不必等待固定的时间。这将帮助你的代码更加稳定和高效。

英文:

I have made a python script for webscraping product information from Target using Selenium. Since there are a lot of products I have added a loop to iterate the process. Unfortunately this process is too slow and I was wondering if there were ways to improve the efficiency i.e use wait times etc.

For each iteration the information of the products is appended to a list 'lstTitles'. Below I have attached part of the code:

options = Options()
options.add_argument(&#39;--headless=new&#39;)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
# loop on excel dataset
lstReview = []
name=&#39;Target&#39; #for now
x_lst=[]
for i in files:
    dpci = i
    print(dpci)
    # variables
    lstTitles = []
    OverallRating = &quot;&quot;
    # navigate to the target page
    driver.get(&#39;https://www.target.com/s?searchTerm=&#39; + dpci)
    time.sleep(5)
    glink=driver.find_elements(by=By.XPATH, value=&#39;//*[@id=&quot;pageBodyContainer&quot;]/div[1]/div/div[4]/div/div/div[2]/div/section/div/div/div/div/div[1]/div[2]/div/div/div[1]/div[1]/div[1]/a&#39;);
    print(type(glink))
    # first link open
    flag=0
    for i in glink:
        print(i.get_attribute(&#39;href&#39;))
        i.click()
        flag=1
        break
    if(flag==0):
        continue #if link not found then go to next dpci
    driver.execute_script(&quot;window.scrollTo(0, 1000)&quot;)
    time.sleep(3)
    driver.execute_script(&quot;window.scrollTo(1000, 2000)&quot;)
    time.sleep(3)
    driver.execute_script(&quot;window.scrollTo(3000, 4000)&quot;)
    time.sleep(3)
    driver.execute_script(&quot;window.scrollTo(4000, 5000)&quot;)
    #input(&quot;TEST&quot;)
    loadMoreFound = &quot;Yes&quot;
    #dtlink = driver.find_elements(by=By.CSS_SELECTOR(&#39;.h-text-center.h-padding-v-tight&#39;)[0])
    # Click on Load More Button
    while loadMoreFound == &quot;Yes&quot;:
        loadMoreFound = &quot;No&quot;
        try:
            print(&quot;load more button&quot;)
            clink=driver.find_elements(by=By.CSS_SELECTOR, value=&#39;.h-text-center .h-padding-v-tight&#39;)
            #print(len(clink))
            #print(type(clink))
            #print(clink)
            for loadmore in clink:
#                 print(&quot;in the for&quot;)
                t = loadmore.get_attribute(&#39;innerHTML&#39;)
#                 print(t)
                #input(&quot;load more&quot;)
                if &quot;Load&quot; in t:
#                     print(&quot;Load more button found&quot;)
                    loadMoreFound = &quot;Yes&quot;
                    loadmore.click()
                    time.sleep(3)
                    break
        except:
            print(&quot;Error1&quot;)
            loadMoreFound = &quot;No&quot;
    counter = 0
    
    # Title
    time.sleep(3)
    ReviewTitle=driver.find_elements(by=By.CLASS_NAME, value=&#39;jfrTHg&#39;)
    #print(str(len(ReviewTitle)))
    # Get Reviews
    for i in ReviewTitle:
        t = i.get_attribute(&#39;innerText&#39;)
        #print(t)
        lstTitles.append(t)
        #input(&quot;get&quot;)
    print(&quot;Moving to next DPCI&quot;)

I have added time.sleep wherever a new element was to be located or clicked. Apart from that my intuition behind wait times is not very solid so I am not sure whether to add an explicit wait or implicit wait to my code. Any feedback would highly appreciated!

答案1

得分: 2

time.sleep() 不是在Selenium中处理等待的最佳方式。相反，使用Selenium内置的WebDriverWait类。

示例：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
# 在元素被找到之前最多等待10秒，否则会抛出TimeoutException
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'element_id')))

当您知道页面上只有一个特定元素时，使用find_element_by_... 而不是 find_elements_by_...（例如，当获取 'glink' 时）。这样稍微更高效，因为Selenium在找到第一个匹配项后就会停止查找。
您当前的滚动模式涉及固定的距离和等待时间。相反，尝试滚动直到找到某个特定元素或者无法再滚动为止。
而不是不断地检查“加载更多”按钮，您可以设置一个循环，当按钮找不到或无法点击时停止。
如果您要独立地抓取多个页面，您可以在Python中使用多进程同时抓取多个页面。

英文:

time.sleep() is not the best way to handle waiting in Selenium. Instead, use Selenium's built-in WebDriverWait class

Example:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
# Wait up to 10 seconds before throwing a TimeoutException unless the element is found
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, &#39;element_id&#39;)))

When you know there's only one of a certain element on the page, use find_element_by_... instead of find_elements_by_... (e.g., when getting 'glink'). This is slightly more efficient because Selenium stops looking as soon as it finds the first match.
Your current scrolling pattern involves fixed distances and wait
times. Instead, try scrolling until a certain element is found or
until you can't scroll any further.
Instead of continuously checking for the 'Load more' button, you can
set up a loop that stops when the button isn't found or can't be
clicked.
If you're scraping multiple pages independently,
you can make use of multiprocessing in Python to scrape multiple
pages at the same time.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Selenium提高网页抓取速度？

问题

答案1

调用Python的超级（super）描述符协议。

根据Pandas中一列的分类值，对具有相似前缀的多列进行分组，并进行求和。

Kafka 确保消费者组保持活跃

使用Python和Gtk.AlertDialog

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。