如何使用Python来抓取一个具有多个页面或分类的动态内容的网站。

huangapple go评论81阅读模式
英文:

How to scrape a website that has dynamic content in multiple pages or categories using python

问题

我正在学习使用Python进行网页抓取,作为学习项目,我尝试从一个超市网站中提取所有产品及其价格。

这家超市有100多个产品类别。以下是一个产品类别的页面:

链接

正如你所见,一些产品具有折扣价格,并且它们在页面首次加载时不会被加载,因此它们会在之后动态加载。

你可以通过使用Selenium和一个带有等待时间的Webdriver来处理这个问题,就像这样:

import requests
import json
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time

def getHtmlDynamic(url, time_wait):
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(time_wait)
    soup = BeautifulSoup(driver.page_source, 'html5lib')
    driver.quit()

    return html

def getProductsAndPrices(html):
    prodsJson = html.find_all('script', {'type': 'application/ld+json'})
    dfProds = pd.json_normalize(json.loads(prodsJson[1].contents[0])['itemListElement'])
    
    pricesList = html.find_all('div', {'class': 'contenedor-precio'})
    prices = []

    for row in pricesList:
        price_row = row.find_all('span')
        for price in price_row:
            priceFinal = price.text
            prices.append(priceFinal)
    
    pricesFinalList = prices[:dfProds.shape[0]]
    
    dfProds['price'] = pricesFinalList

    return dfProds

htmlProducts = getHtmlDynamic(url='https://www.vea.com.ar/electro/aire-acondicionado-y-ventilacion', time_wait=20)
    
dfProds = getProductsAndPrices(htmlProducts)

这对于一个特定类别来说效果很好,但当我尝试将其扩展到更多类别(例如10个)时,会出现崩溃。动态内容在第二次迭代后没有正确加载。

dfProductsConsolidated = pd.DataFrame([])

for category in dfCategories['categoryURL'][:10]:
    htmlProducts = getHtmlDynamic(url=category, time_wait=20)
    
    dfProds = getProductsAndPrices(htmlProducts)
    
    dfProductsConsolidated = dfProductsConsolidated.append(dfProds)

是否有一种方法可以在大规模进行这种抓取?是否有任何最佳实践可以帮助我处理这个问题?

提前感谢您的回答!

英文:

I'm learning web scraping with Python and as a learning project I'm trying to extract all the products and their prices from a supermarket website.

This supermarket has more than 100 categories of products. This is the page of one category:

Link

As you can see, some products have discount prices and they are not loaded at the first load of the page, so they are dynamically loaded after.

I could handle that by using Selenium and a Webdriver with a waiting time of a couple of seconds, like this:

import requests
import json
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time

def getHtmlDynamic(url, time_wait):
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(time_wait)
    soup = BeautifulSoup(driver.page_source, 'html5lib')
    driver.quit()

    return html

def getProductsAndPrices(html):
    prodsJson = html.find_all('script', {'type': 'application/ld+json'})
    dfProds = pd.json_normalize(json.loads(prodsJson[1].contents[0])['itemListElement'])
    
    pricesList = html.find_all('div', {'class': 'contenedor-precio'})
    prices = []

    for row in pricesList:
        price_row = row.find_all('span')
        for price in price_row:
            priceFinal = price.text
            prices.append(priceFinal)
    
    pricesFinalList = prices[:dfProds.shape[0]]
    
    dfProds['price'] = pricesFinalList

    return dfProds

htmlProducts = getHtmlDynamic(url='https://www.vea.com.ar/electro/aire-acondicionado-y-ventilacion', time_wait=20)
    
dfProds = getProductsAndPrices(htmlProducts)

This works well for one specific category, but when I tried to scale it to more categories (10 for example) with a for loop, it crashes. The dynamic content is not correctly loaded after the second iteration.

dfProductsConsolidated = pd.DataFrame([])

for category in dfCategories['categoryURL'][:10]:
    htmlProducts = getHtmlDynamic(url=category, time_wait=20)
    
    dfProds = getProductsAndPrices(htmlProducts)
    
    dfProductsConsolidated = dfProductsConsolidated.append(dfProds)

Is there any way to handle this kind of scraping at a large scale? any best practices that can help me with this?

Thanks in advance!

答案1

得分: 1

为了加快页面加载速度,我建议在无界面模式下启动驱动程序并禁用图像。

options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--blink-settings=imagesEnabled=false')

driver = webdriver.Chrome(options=options)

以下的代码用于提取所有10个类别中的产品数据。该代码单击按钮"Mostrar más"(显示更多),如果存在的话,以加载所有产品。在我的计算机上执行大约需要14分钟,而且没有崩溃。由于类别"Almacen/Desayuno-y-Merienda"包含800个产品,因此速度很慢。

数据(商品和价格)存储在字典中,每个类别都有一个单独的字典。所有这些字典都存储在名为data的另一个字典中。

from selenium.common.exceptions import ElementClickInterceptedException, StaleElementReferenceException

urls = '''
https://www.vea.com.ar/Electro/aire-acondicionado-y-ventilacion
https://www.vea.com.ar/Almacen/Aceites-y-Vinagres
https://www.vea.com.ar/Almacen/Desayuno-y-Merienda
https://www.vea.com.ar/Lacteos/Leches
https://www.vea.com.ar/Frutas-y-Verduras/Frutas
https://www.vea.com.ar/Bebes-y-Ninos/Jugueteria
https://www.vea.com.ar/Quesos-y-Fiambres/Fiambres
https://www.vea.com.ar/Panaderia-y-Reposteria/Panaderia
https://www.vea.com.ar/Mascotas/Perros
https://www.vea.com.ar/Bebidas/Gaseosas'''.split('\n')

categories = 
for url in urls]
data = {key: {} for key in categories} for idx, category in enumerate(categories): info = f'[{idx + 1}/{len(categories)}] {category} ' print(info, end='') driver.get('https://www.vea.com.ar/' + category) number_of_products = 0 while number_of_products == 0: footer = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'p.text-content'))) number_of_products = int(footer.text.split()[3]) number_of_loaded_products = int(footer.text.split()[1]) print(f'(loaded products={number_of_loaded_products}, total={number_of_products})', end='\r') while number_of_loaded_products < number_of_products: footer = driver.find_element(By.CSS_SELECTOR, 'p.text-content') driver.execute_script('arguments[0].scrollIntoView({block: "center"});', footer) show_more = driver.find_elements(By.XPATH, "//div[text()='Mostrar más']") if show_more: try: show_more[0].click() except (ElementClickInterceptedException, StaleElementReferenceException): continue number_of_loaded_products = int(footer.text.split()[1]) print(info + f'(loaded products={number_of_loaded_products}, total={number_of_products})', end='\r') time.sleep(1) loaded_products = json.loads(driver.find_element(By.CSS_SELECTOR, "body script[type='application/ld+json']").get_attribute('innerText'))['itemListElement'] products = {'item': [], 'price': []} for prod in loaded_products: products['item'] += [prod['item']['name']] products['price'] += [prod['item']['offers']['offers'][0]['price']] data[category] = products print()

在循环时,代码会打印各种信息。最后,您可以运行 pd.DataFrame(data[categories[idx]]) 来可视化提取的数据,其中 idx 是从 0len(categories)-1 的整数。例如,对于idx=1,您将获得如下图所示的数据。

英文:

To speed up the loading of pages I suggest to start the driver in headless mode and with images disabled.

options = webdriver.ChromeOptions()
options.add_argument(&quot;--headless=new&quot;)
options.add_argument(&#39;--blink-settings=imagesEnabled=false&#39;)
driver = webdriver.Chrome(options=options)

The following code scrapes data for all the products in the 10 categories. The code clicks the button "Mostrar más" (show more) if it is present, so that all the products are loaded. The execution took about 14 minutes on my computer, and it did not crash. It was so slow because the category "Almacen/Desayuno-y-Merienda" contains 800 products.

Data (items and prices) are stored in a dictionary, and each category has a separate dictionary. All the dictionraties are stored in another dictionary called data.

from selenium.common.exceptions import ElementClickInterceptedException, StaleElementReferenceException
urls = &#39;&#39;&#39;https://www.vea.com.ar/Electro/aire-acondicionado-y-ventilacion
https://www.vea.com.ar/Almacen/Aceites-y-Vinagres
https://www.vea.com.ar/Almacen/Desayuno-y-Merienda
https://www.vea.com.ar/Lacteos/Leches
https://www.vea.com.ar/Frutas-y-Verduras/Frutas
https://www.vea.com.ar/Bebes-y-Ninos/Jugueteria
https://www.vea.com.ar/Quesos-y-Fiambres/Fiambres
https://www.vea.com.ar/Panaderia-y-Reposteria/Panaderia
https://www.vea.com.ar/Mascotas/Perros
https://www.vea.com.ar/Bebidas/Gaseosas'''.split('\n')
categories = 
for url in urls] data = {key:{} for key in categories} for idx,category in enumerate(categories): info = f&#39;[{idx+1}/{len(categories)}] {category} &#39; print(info, end=&#39;&#39;) driver.get(&#39;https://www.vea.com.ar/&#39; + category) number_of_products = 0 while number_of_products == 0: footer = WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, &#39;p.text-content&#39;))) number_of_products = int(footer.text.split()[3]) number_of_loaded_products = int(footer.text.split()[1]) print(f&#39;(loaded products={number_of_loaded_products}, total={number_of_products})&#39;, end=&#39;\r&#39;) while number_of_loaded_products &lt; number_of_products: footer = driver.find_element(By.CSS_SELECTOR, &#39;p.text-content&#39;) driver.execute_script(&#39;arguments[0].scrollIntoView({block: &quot;center&quot;});&#39;, footer) show_more = driver.find_elements(By.XPATH, &quot;//div[text()=&#39;Mostrar m&#225;s&#39;]&quot;) if show_more: try: show_more[0].click() except (ElementClickInterceptedException, StaleElementReferenceException): continue number_of_loaded_products = int(footer.text.split()[1]) print(info + f&#39;(loaded products={number_of_loaded_products}, total={number_of_products})&#39;, end=&#39;\r&#39;) time.sleep(1) loaded_products = json.loads(driver.find_element(By.CSS_SELECTOR, &quot;body script[type=&#39;application/ld+json&#39;]&quot;).get_attribute(&#39;innerText&#39;))[&#39;itemListElement&#39;] products = {&#39;item&#39;:[],&#39;price&#39;:[]} for prod in loaded_products: products[&#39;item&#39;] += [prod[&#39;item&#39;][&#39;name&#39;]] products[&#39;price&#39;] += [prod[&#39;item&#39;][&#39;offers&#39;][&#39;offers&#39;][0][&#39;price&#39;]] data[category] = products print()

The code prints various info while looping, and in the end you have something like this

[1/10] Electro/aire-acondicionado-y-ventilacion (loaded products=7, total=7)
[2/10] Almacen/Aceites-y-Vinagres (loaded products=87, total=87)
[3/10] Almacen/Desayuno-y-Merienda (loaded products=808, total=808)
[4/10] Lacteos/Leches (loaded products=80, total=80)
[5/10] Frutas-y-Verduras/Frutas (loaded products=70, total=70)
[6/10] Bebes-y-Ninos/Jugueteria (loaded products=57, total=57)
[7/10] Quesos-y-Fiambres/Fiambres (loaded products=19, total=19)
[8/10] Panaderia-y-Reposteria/Panaderia (loaded products=17, total=17)
[9/10] Mascotas/Perros (loaded products=66, total=66)
[10/10] Bebidas/Gaseosas (loaded products=64, total=64)

To visualize the scraped data you can run pd.DataFrame(data[categories[idx]]) where idx is an integer from 0 to len(categories)-1. For example for idx=1 you get

如何使用Python来抓取一个具有多个页面或分类的动态内容的网站。

huangapple
  • 本文由 发表于 2023年2月26日 20:20:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/75571931.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定