web scraping但一个元素没有类或属性

huangapple go评论79阅读模式
英文:

web scraping but an element does not have a class or an attribute

问题

这是我的代码,一切都运行正常,直到颜色、燃料等部分。这是我得到的错误消息:

AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_8568\2004631293.py in <module>
     51 
     52         # 提取颜色、燃料、车身、和puissance_fiscale信息
--> 53         color = detailed_soup.select_one('div[itemprop="color"]').text.strip()
     54         fuel = detailed_soup.find("div", class_="fuel").text.strip()
     55         carrosserie = detailed_soup.find("div", class_="carrosserie").text.strip()

AttributeError: 'NoneType' object has no attribute 'text'

这个错误表明在详细页面中,没有找到匹配的元素,因此无法提取文本。你可以尝试以下方法来解决这个问题:

  1. 首先,确保网站的HTML结构没有发生变化,因为网站结构的更改可能会导致爬虫失败。检查网站是否有任何更新。

  2. 使用try-except块来处理可能出现的AttributeError,以避免程序崩溃。在尝试提取文本之前,检查元素是否存在。

示例:

color_element = detailed_soup.select_one('div[itemprop="color"]')
color = color_element.text.strip() if color_element is not None else "N/A"

这将检查color_element是否为None,如果为None,将使用"N/A"作为默认值。

  1. 如果仍然无法提取信息,可以尝试使用其他方法,如正则表达式或XPath来查找和提取所需的数据。

记得要谨慎爬取网站,遵守网站的使用政策和法律法规。

英文:

I'm new to web scraping and for my first project, I have to scrape a website that sells used cars. The main page of this website shows main details such as the name of the car, price, and km, but if I click on any of theses cars it takes me to a more detailed page where I can find other information such as the color, etc. For the main page everything is good, but on the detailed page when I inspect the page all the elements I need do not have a class or an attribute. I'm using python, BeautifulSoup. Does anyone have any solution?

import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm

# Define a list to store the information for each car
car_info = []

# Define the headers to use for the request
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}

# Create a session to store the headers
session = requests.Session()
session.headers.update(headers)

# Loop through the pages with the car listings
for page_num in tqdm(range(1, 25)): # To get all cars, set the range to 266
    # Make a GET request to the URL for the current page
    url = f'https://www.automobile.tn/fr/occasion/{page_num}'
    page = session.get(url)

    # Parse the HTML content of the page
    soup = BeautifulSoup(page.content, "html.parser")

    # Find all the car listings on the page
    car_listings = soup.find_all("div", class_="occasion-item")

    # Loop through each car listing
    for listing in car_listings:
        # Extract the title and price of the car
        title = listing.find("h2").text.strip()
        price = listing.find("div", class_="price").text.strip()

        # Extract the KM and year of the car
        year = listing.select_one('li[class="year"]').text.strip()
        km = listing.select_one('li[class="road"]').text.strip()

        # Extract the boite information
        boite = listing.find("li", class_="boite").text.strip()

        # Find the link to the detailed page for the car
        detailed_page_link = "https://www.automobile.tn" + listing.select_one("a")["href"]

        # Make a GET request to the URL for the detailed page
        detailed_page = session.get(detailed_page_link)

        # Parse the HTML content of the detailed page
        detailed_soup = BeautifulSoup(detailed_page.content, "html.parser")

        # Extract the color, fuel, carrosserie, and puissance_fiscale information
        color = detailed_soup.find("div", class_="color").text.strip()
        fuel = detailed_soup.find("div", class_="fuel").text.strip()
        carrosserie = detailed_soup.find("div", class_="carrosserie").text.strip()
        puissance_fiscale = detailed_soup.find("div", class_="puissance_fiscale").text.strip()

        transmission = detailed_soup.find("div", class_="transmission").text.strip()

        # Add the information for the car to the list
        car_info.append((title, price, km, year, boite, color, fuel, carrosserie, puissance_fiscale))


df = pd.DataFrame(big_list, columns=['title', 'price', 'year', 'km', 'color', 'carrosserie', 'puissance_fiscale', 'boite', 'fuel', 'transmission'])
df.to_csv('various_cars.csv', index=False)
print(df)

this is my code everything works fine until the part of the color, fuel etc. This is the error i'm getting:

AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_856804631293.py in <module>
     51 
     52         # Extract the color, fuel, carrosserie, and puissance_fiscale information
---> 53         color = detailed_soup.select_one('div[itemprop="color"]').text.strip()
     54         fuel = detailed_soup.find("div", class_="fuel").text.strip()
     55         carrosserie = detailed_soup.find("div", class_="carrosserie").text.strip()

AttributeError: 'NoneType' object has no attribute 'text'

答案1

得分: 0

这部分内容涉及到Python代码和HTML选择器,不需要翻译。如果有其他需要翻译的部分,请提供明确的文本,我将为您提供翻译。

英文:

Can you include a link to a page as well as a screenshot or snippet of the source html that led you to use the div[itemprop="color"] / div.fuel / div.carrosserie / div.puissance_fiscale selectors? I only inspected /peugeot/208/84502 on my browser, but I couldn't find any of them there which means at least SOME of the sites don't have any elements that match these selectors.

If you're not sure that your target element will be found, you should check for None before trying to extract text like in the function below. <sub>(It's a simplification of selectGet from this set of functions.)</sub>

def selectTxt(tagSoup, selector=&#39;&#39;, defaultVal=None):
    el = tagSoup.select_one(str(selector).strip()) if selector else tagSoup
    return defaultVal if el is None else el.get_text(&#39; &#39;, strip=True)

So, using color = selectTxt(detailed_soup,&#39;div[itemprop=&quot;color&quot;]&#39;) would avoid raising error, but would also just return None for the page that I inspected. The solution below uses the function but also just loops to get all items from the specs preview of the listing, the info section of the details page, and the technical details section (which has the color btw) of the details page.

firstPg, lastPg = 1, 25
car_info = [] ## AS BEFORE
renameKeys = {
    &#39;Kms&#39;: &#39;Kilom&#233;trage&#39;, &#39;Localit&#233;&#39;: &#39;Gouvernorat&#39;, &#39;Bo&#238;te vitesses&#39;: &#39;Bo&#238;te&#39;
}

headers = { ## AS BEFORE
    &#39;User-Agent&#39;: &#39;Mozilla/5.0 (X11; Linux x86_64) &#39;
    &#39;AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36&#39;
}
session = requests.Session() ## AS BEFORE
session.headers.update(headers) ## AS BEFORE


for page_num in tqdm(range(firstPg, lastPg+1)): 
    url = f&#39;https://www.automobile.tn/fr/occasion/{page_num}&#39; ## AS BEFORE
    page = session.get(url) ## AS BEFORE
    soup = BeautifulSoup(page.content, &quot;html.parser&quot;) ## AS BEFORE
    car_listings = soup.find_all(&quot;div&quot;, class_=&quot;occasion-item&quot;) ## AS BEFORE 

    for listing in car_listings:
        liDets = { k: selectTxt(listing, s) for k, s in [
            (&#39;title&#39;,&#39;h2&#39;), (&#39;price&#39;,&#39;div.price&#39;), (&#39;description&#39;,&#39;p&#39;)] }
        
        sPrvw_sel = &#39;span.name~span.value&#39;
        for spec in listing.select(f&#39;ul.specs-preview&gt;li:has({sPrvw_sel})&#39;):
            k = selectTxt(spec,&#39;span.name:has(~span.value)&#39;).strip(&#39;:&#39;).strip()
            liDets[renameKeys.get(k,k)] = selectTxt(spec,sPrvw_sel)
        
        listingLink = listing.select_one(&#39;a[href]&#39;)
        if not listingLink: ## JUST IN CASE
            car_info.append(liDets)
            continue
        detailed_page_link = &quot;https://www.automobile.tn&quot; + listingLink[&quot;href&quot;] 
        detailed_page = session.get(detailed_page_link) ## AS BEFORE
        detailed_soup = BeautifulSoup(detailed_page.content, &quot;html.parser&quot;) ## AS BEFORE

        for info in detailed_soup.select(&#39;div.infos&gt;ul&gt;li:has(label~span)&#39;):
            k = selectTxt(info,&#39;label:has(~span)&#39;).strip(&#39;:&#39;).strip()
            liDets[renameKeys.get(k,k)] = selectTxt(info,&#39;label~span&#39;)

        tblSel = &#39;table[data-index=&quot;main&quot;]:has(thead~tbody&gt;tr:only-child)&#39;
        for spec in detailed_soup.select(tblSel):
            k = selectTxt(spec,&#39;thead&#39;).strip(&#39;:&#39;).strip()
            liDets[renameKeys.get(k,k)] = selectTxt(spec,&#39;tbody&gt;tr&#39;) 
        
        # phoneSel = &#39;div.infos&gt;ul&gt;li&gt;span:has(i.fa-phone-alt)&#39;
        # liDets[&#39;phone&#39;] = selectTxt(detailed_soup, phoneSel)
        # liDets[&#39;details_link&#39;] = detailed_page_link
        car_info.append(liDets) 

Use firstPg and lastPg to set which pages to scrape, and renameKeys to change any column names that are scraped from the page with k= selectTxt.... [The 3 I've already set are just to not have duplicate columns with different names from overlaps between the specs preview and the details page.] With lastPg=2 I get web scraping但一个元素没有类或属性


Btw, if you uncomment the last 3 lines and add two more for loops to also include the bottom of the info section and the [collapsed] equipments section:

        for desc in detailed_soup.select(&#39;div.description&gt;p:has(b+br)&#39;):
            k = selectTxt(desc,&#39;b:has(+br)&#39;).strip(&#39;:&#39;).strip()
            desc.select_one(&#39;b:has(+br)&#39;).replace_with(&#39;&#39;) # !THIS EDITS detailed_soup
            liDets[renameKeys.get(k,k)] = desc.get_text(&#39; &#39;, strip=True)

        for e,eq in enumerate(detailed_soup.select(&#39;table[data-index=&quot;add&quot;]&#39;)):
            k = selectTxt(eq,&#39;thead&#39;,f&#39;Equipements {e}&#39;).strip(&#39;:&#39;).strip()
            liDets[renameKeys.get(k,k)]=[selectTxt(x) for x in eq.select(&#39;td&#39;)]

        phoneSel = &#39;div.infos&gt;ul&gt;li&gt;span:has(i.fa-phone-alt)&#39;
        liDets[&#39;phone&#39;] = selectTxt(detailed_soup, phoneSel)
        liDets[&#39;details_link&#39;] = detailed_page_link

then the DataFrame might have 7 more columns and you'll have collected pretty much all the data available on the details page [except for the comments].

If you want the comments as well, you can target them with the .dsq-widget-comment selector

        cmntList = detailed_soup.select(&#39;.dsq-widget-comment&gt;p:only-child&#39;)
        liDets[&#39;comments_list&#39;] = [selectTxt(x)for x in cmntList]

huangapple
  • 本文由 发表于 2023年2月6日 08:17:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/75356409.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定