2023年2月26日 20:20:13go评论184阅读模式

英文:

How to scrape a website that has dynamic content in multiple pages or categories using python

问题

我正在学习使用Python进行网页抓取，作为学习项目，我尝试从一个超市网站中提取所有产品及其价格。

这家超市有100多个产品类别。以下是一个产品类别的页面：

链接

正如你所见，一些产品具有折扣价格，并且它们在页面首次加载时不会被加载，因此它们会在之后动态加载。

你可以通过使用Selenium和一个带有等待时间的Webdriver来处理这个问题，就像这样：

import requests
import json
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time

def getHtmlDynamic(url, time_wait):
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(time_wait)
    soup = BeautifulSoup(driver.page_source, 'html5lib')
    driver.quit()

    return html

def getProductsAndPrices(html):
    prodsJson = html.find_all('script', {'type': 'application/ld+json'})
    dfProds = pd.json_normalize(json.loads(prodsJson[1].contents[0])['itemListElement'])
    
    pricesList = html.find_all('div', {'class': 'contenedor-precio'})
    prices = []

    for row in pricesList:
        price_row = row.find_all('span')
        for price in price_row:
            priceFinal = price.text
            prices.append(priceFinal)
    
    pricesFinalList = prices[:dfProds.shape[0]]
    
    dfProds['price'] = pricesFinalList

    return dfProds

htmlProducts = getHtmlDynamic(url='https://www.vea.com.ar/electro/aire-acondicionado-y-ventilacion', time_wait=20)
    
dfProds = getProductsAndPrices(htmlProducts)

这对于一个特定类别来说效果很好，但当我尝试将其扩展到更多类别（例如10个）时，会出现崩溃。动态内容在第二次迭代后没有正确加载。

dfProductsConsolidated = pd.DataFrame([])

for category in dfCategories['categoryURL'][:10]:
    htmlProducts = getHtmlDynamic(url=category, time_wait=20)
    
    dfProds = getProductsAndPrices(htmlProducts)
    
    dfProductsConsolidated = dfProductsConsolidated.append(dfProds)

是否有一种方法可以在大规模进行这种抓取？是否有任何最佳实践可以帮助我处理这个问题？

提前感谢您的回答！

英文:

I'm learning web scraping with Python and as a learning project I'm trying to extract all the products and their prices from a supermarket website.

This supermarket has more than 100 categories of products. This is the page of one category:

Link

As you can see, some products have discount prices and they are not loaded at the first load of the page, so they are dynamically loaded after.

I could handle that by using Selenium and a Webdriver with a waiting time of a couple of seconds, like this:

import requests
import json
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time

def getHtmlDynamic(url, time_wait):
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(time_wait)
    soup = BeautifulSoup(driver.page_source, &#39;html5lib&#39;)
    driver.quit()

    return html

def getProductsAndPrices(html):
    prodsJson = html.find_all(&#39;script&#39;, {&#39;type&#39;: &#39;application/ld+json&#39;})
    dfProds = pd.json_normalize(json.loads(prodsJson[1].contents[0])[&#39;itemListElement&#39;])
    
    pricesList = html.find_all(&#39;div&#39;, {&#39;class&#39;: &#39;contenedor-precio&#39;})
    prices = []

    for row in pricesList:
        price_row = row.find_all(&#39;span&#39;)
        for price in price_row:
            priceFinal = price.text
            prices.append(priceFinal)
    
    pricesFinalList = prices[:dfProds.shape[0]]
    
    dfProds[&#39;price&#39;] = pricesFinalList

    return dfProds

htmlProducts = getHtmlDynamic(url=&#39;https://www.vea.com.ar/electro/aire-acondicionado-y-ventilacion&#39;, time_wait=20)
    
dfProds = getProductsAndPrices(htmlProducts)

This works well for one specific category, but when I tried to scale it to more categories (10 for example) with a for loop, it crashes. The dynamic content is not correctly loaded after the second iteration.

dfProductsConsolidated = pd.DataFrame([])

for category in dfCategories[&#39;categoryURL&#39;][:10]:
    htmlProducts = getHtmlDynamic(url=category, time_wait=20)
    
    dfProds = getProductsAndPrices(htmlProducts)
    
    dfProductsConsolidated = dfProductsConsolidated.append(dfProds)

Is there any way to handle this kind of scraping at a large scale? any best practices that can help me with this?

Thanks in advance!

答案1

得分: 1

为了加快页面加载速度，我建议在无界面模式下启动驱动程序并禁用图像。

options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--blink-settings=imagesEnabled=false')

driver = webdriver.Chrome(options=options)

以下的代码用于提取所有10个类别中的产品数据。该代码单击按钮"Mostrar más"（显示更多），如果存在的话，以加载所有产品。在我的计算机上执行大约需要14分钟，而且没有崩溃。由于类别"Almacen/Desayuno-y-Merienda"包含800个产品，因此速度很慢。

数据（商品和价格）存储在字典中，每个类别都有一个单独的字典。所有这些字典都存储在名为data的另一个字典中。

from selenium.common.exceptions import ElementClickInterceptedException, StaleElementReferenceException

urls = '''
https://www.vea.com.ar/Electro/aire-acondicionado-y-ventilacion
https://www.vea.com.ar/Almacen/Aceites-y-Vinagres
https://www.vea.com.ar/Almacen/Desayuno-y-Merienda
https://www.vea.com.ar/Lacteos/Leches
https://www.vea.com.ar/Frutas-y-Verduras/Frutas
https://www.vea.com.ar/Bebes-y-Ninos/Jugueteria
https://www.vea.com.ar/Quesos-y-Fiambres/Fiambres
https://www.vea.com.ar/Panaderia-y-Reposteria/Panaderia
https://www.vea.com.ar/Mascotas/Perros
https://www.vea.com.ar/Bebidas/Gaseosas'''.split('\n')

categories = 
 for url in urls]
data = {key: {} for key in categories}

for idx, category in enumerate(categories):
    info = f'[{idx + 1}/{len(categories)}] {category} '
    print(info, end='')
    driver.get('https://www.vea.com.ar/' + category)

    number_of_products = 0
    while number_of_products == 0:
        footer = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'p.text-content')))
        number_of_products = int(footer.text.split()[3])
        number_of_loaded_products = int(footer.text.split()[1])
    print(f'(loaded products={number_of_loaded_products}, total={number_of_products})', end='\r')

    while number_of_loaded_products < number_of_products:
        footer = driver.find_element(By.CSS_SELECTOR, 'p.text-content')
        driver.execute_script('arguments[0].scrollIntoView({block: "center"});', footer)
        show_more = driver.find_elements(By.XPATH, "//div[text()='Mostrar más']")
        if show_more:
            try:
                show_more[0].click()
            except (ElementClickInterceptedException, StaleElementReferenceException):
                continue
        number_of_loaded_products = int(footer.text.split()[1])
        print(info + f'(loaded products={number_of_loaded_products}, total={number_of_products})', end='\r')
        time.sleep(1)

    loaded_products = json.loads(driver.find_element(By.CSS_SELECTOR, "body script[type='application/ld+json']").get_attribute('innerText'))['itemListElement']
    products = {'item': [], 'price': []}
    for prod in loaded_products:
        products['item'] += [prod['item']['name']]
        products['price'] += [prod['item']['offers']['offers'][0]['price']]

    data[category] = products
    print()

在循环时，代码会打印各种信息。最后，您可以运行 pd.DataFrame(data[categories[idx]]) 来可视化提取的数据，其中 idx 是从 0 到 len(categories)-1 的整数。例如，对于idx=1，您将获得如下图所示的数据。

英文:

To speed up the loading of pages I suggest to start the driver in headless mode and with images disabled.

options = webdriver.ChromeOptions()
options.add_argument(&quot;--headless=new&quot;)
options.add_argument(&#39;--blink-settings=imagesEnabled=false&#39;)
driver = webdriver.Chrome(options=options)

The following code scrapes data for all the products in the 10 categories. The code clicks the button "Mostrar más" (show more) if it is present, so that all the products are loaded. The execution took about 14 minutes on my computer, and it did not crash. It was so slow because the category "Almacen/Desayuno-y-Merienda" contains 800 products.

Data (items and prices) are stored in a dictionary, and each category has a separate dictionary. All the dictionraties are stored in another dictionary called data.

from selenium.common.exceptions import ElementClickInterceptedException, StaleElementReferenceException
urls = &#39;&#39;&#39;https://www.vea.com.ar/Electro/aire-acondicionado-y-ventilacion
https://www.vea.com.ar/Almacen/Aceites-y-Vinagres
https://www.vea.com.ar/Almacen/Desayuno-y-Merienda
https://www.vea.com.ar/Lacteos/Leches
https://www.vea.com.ar/Frutas-y-Verduras/Frutas
https://www.vea.com.ar/Bebes-y-Ninos/Jugueteria
https://www.vea.com.ar/Quesos-y-Fiambres/Fiambres
https://www.vea.com.ar/Panaderia-y-Reposteria/Panaderia
https://www.vea.com.ar/Mascotas/Perros
https://www.vea.com.ar/Bebidas/Gaseosas'''.split('\n')
categories = 
 for url in urls]
data = {key:{} for key in categories}
for idx,category in enumerate(categories):
info = f&#39;[{idx+1}/{len(categories)}] {category} &#39;
print(info, end=&#39;&#39;)
driver.get(&#39;https://www.vea.com.ar/&#39; + category)
number_of_products = 0
while number_of_products == 0:
footer = WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, &#39;p.text-content&#39;)))
number_of_products = int(footer.text.split()[3])
number_of_loaded_products = int(footer.text.split()[1])
print(f&#39;(loaded products={number_of_loaded_products}, total={number_of_products})&#39;, end=&#39;\r&#39;)
while number_of_loaded_products &lt; number_of_products:
footer = driver.find_element(By.CSS_SELECTOR, &#39;p.text-content&#39;)
driver.execute_script(&#39;arguments[0].scrollIntoView({block: &quot;center&quot;});&#39;, footer)
show_more = driver.find_elements(By.XPATH, &quot;//div[text()=&#39;Mostrar m&#225;s&#39;]&quot;)
if show_more:
try:
show_more[0].click()
except (ElementClickInterceptedException, StaleElementReferenceException):
continue
number_of_loaded_products = int(footer.text.split()[1])
print(info + f&#39;(loaded products={number_of_loaded_products}, total={number_of_products})&#39;, end=&#39;\r&#39;)
time.sleep(1)
loaded_products = json.loads(driver.find_element(By.CSS_SELECTOR, &quot;body script[type=&#39;application/ld+json&#39;]&quot;).get_attribute(&#39;innerText&#39;))[&#39;itemListElement&#39;]
products = {&#39;item&#39;:[],&#39;price&#39;:[]}
for prod in loaded_products:
products[&#39;item&#39;]  += [prod[&#39;item&#39;][&#39;name&#39;]]
products[&#39;price&#39;] += [prod[&#39;item&#39;][&#39;offers&#39;][&#39;offers&#39;][0][&#39;price&#39;]]
data[category] = products
print()

The code prints various info while looping, and in the end you have something like this

[1/10] Electro/aire-acondicionado-y-ventilacion (loaded products=7, total=7)
[2/10] Almacen/Aceites-y-Vinagres (loaded products=87, total=87)
[3/10] Almacen/Desayuno-y-Merienda (loaded products=808, total=808)
[4/10] Lacteos/Leches (loaded products=80, total=80)
[5/10] Frutas-y-Verduras/Frutas (loaded products=70, total=70)
[6/10] Bebes-y-Ninos/Jugueteria (loaded products=57, total=57)
[7/10] Quesos-y-Fiambres/Fiambres (loaded products=19, total=19)
[8/10] Panaderia-y-Reposteria/Panaderia (loaded products=17, total=17)
[9/10] Mascotas/Perros (loaded products=66, total=66)
[10/10] Bebidas/Gaseosas (loaded products=64, total=64)

To visualize the scraped data you can run pd.DataFrame(data[categories[idx]]) where idx is an integer from 0 to len(categories)-1. For example for idx=1 you get

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Python来抓取一个具有多个页面或分类的动态内容的网站。

问题

答案1

‘Columns must be same length as key’ error when trying .Split

Python嵌套对象

Sklearn SequentialFeatureSelector：“Pipeline 应该是一个分类器”，当使用分类器时

我可以展示数据集列的分布，按特定方式排列图像吗？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论