英文:
web scraping but an element does not have a class or an attribute
问题
这是我的代码,一切都运行正常,直到颜色、燃料等部分。这是我得到的错误消息:
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_8568\2004631293.py in <module>
51
52 # 提取颜色、燃料、车身、和puissance_fiscale信息
--> 53 color = detailed_soup.select_one('div[itemprop="color"]').text.strip()
54 fuel = detailed_soup.find("div", class_="fuel").text.strip()
55 carrosserie = detailed_soup.find("div", class_="carrosserie").text.strip()
AttributeError: 'NoneType' object has no attribute 'text'
这个错误表明在详细页面中,没有找到匹配的元素,因此无法提取文本。你可以尝试以下方法来解决这个问题:
-
首先,确保网站的HTML结构没有发生变化,因为网站结构的更改可能会导致爬虫失败。检查网站是否有任何更新。
-
使用try-except块来处理可能出现的AttributeError,以避免程序崩溃。在尝试提取文本之前,检查元素是否存在。
示例:
color_element = detailed_soup.select_one('div[itemprop="color"]')
color = color_element.text.strip() if color_element is not None else "N/A"
这将检查color_element
是否为None,如果为None,将使用"N/A"作为默认值。
- 如果仍然无法提取信息,可以尝试使用其他方法,如正则表达式或XPath来查找和提取所需的数据。
记得要谨慎爬取网站,遵守网站的使用政策和法律法规。
英文:
I'm new to web scraping and for my first project, I have to scrape a website that sells used cars. The main page of this website shows main details such as the name of the car, price, and km, but if I click on any of theses cars it takes me to a more detailed page where I can find other information such as the color, etc. For the main page everything is good, but on the detailed page when I inspect the page all the elements I need do not have a class or an attribute. I'm using python, BeautifulSoup. Does anyone have any solution?
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
# Define a list to store the information for each car
car_info = []
# Define the headers to use for the request
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}
# Create a session to store the headers
session = requests.Session()
session.headers.update(headers)
# Loop through the pages with the car listings
for page_num in tqdm(range(1, 25)): # To get all cars, set the range to 266
# Make a GET request to the URL for the current page
url = f'https://www.automobile.tn/fr/occasion/{page_num}'
page = session.get(url)
# Parse the HTML content of the page
soup = BeautifulSoup(page.content, "html.parser")
# Find all the car listings on the page
car_listings = soup.find_all("div", class_="occasion-item")
# Loop through each car listing
for listing in car_listings:
# Extract the title and price of the car
title = listing.find("h2").text.strip()
price = listing.find("div", class_="price").text.strip()
# Extract the KM and year of the car
year = listing.select_one('li[class="year"]').text.strip()
km = listing.select_one('li[class="road"]').text.strip()
# Extract the boite information
boite = listing.find("li", class_="boite").text.strip()
# Find the link to the detailed page for the car
detailed_page_link = "https://www.automobile.tn" + listing.select_one("a")["href"]
# Make a GET request to the URL for the detailed page
detailed_page = session.get(detailed_page_link)
# Parse the HTML content of the detailed page
detailed_soup = BeautifulSoup(detailed_page.content, "html.parser")
# Extract the color, fuel, carrosserie, and puissance_fiscale information
color = detailed_soup.find("div", class_="color").text.strip()
fuel = detailed_soup.find("div", class_="fuel").text.strip()
carrosserie = detailed_soup.find("div", class_="carrosserie").text.strip()
puissance_fiscale = detailed_soup.find("div", class_="puissance_fiscale").text.strip()
transmission = detailed_soup.find("div", class_="transmission").text.strip()
# Add the information for the car to the list
car_info.append((title, price, km, year, boite, color, fuel, carrosserie, puissance_fiscale))
df = pd.DataFrame(big_list, columns=['title', 'price', 'year', 'km', 'color', 'carrosserie', 'puissance_fiscale', 'boite', 'fuel', 'transmission'])
df.to_csv('various_cars.csv', index=False)
print(df)
this is my code everything works fine until the part of the color, fuel etc. This is the error i'm getting:
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_856804631293.py in <module>
51
52 # Extract the color, fuel, carrosserie, and puissance_fiscale information
---> 53 color = detailed_soup.select_one('div[itemprop="color"]').text.strip()
54 fuel = detailed_soup.find("div", class_="fuel").text.strip()
55 carrosserie = detailed_soup.find("div", class_="carrosserie").text.strip()
AttributeError: 'NoneType' object has no attribute 'text'
答案1
得分: 0
这部分内容涉及到Python代码和HTML选择器,不需要翻译。如果有其他需要翻译的部分,请提供明确的文本,我将为您提供翻译。
英文:
Can you include a link to a page as well as a screenshot or snippet of the source html that led you to use the div[itemprop="color"]
/ div.fuel
/ div.carrosserie
/ div.puissance_fiscale
selectors? I only inspected /peugeot/208/84502 on my browser, but I couldn't find any of them there which means at least SOME of the sites don't have any elements that match these selectors.
If you're not sure that your target element will be found, you should check for None
before trying to extract text like in the function below. <sub>(It's a simplification of selectGet
from this set of functions.)</sub>
def selectTxt(tagSoup, selector='', defaultVal=None):
el = tagSoup.select_one(str(selector).strip()) if selector else tagSoup
return defaultVal if el is None else el.get_text(' ', strip=True)
So, using color = selectTxt(detailed_soup,'div[itemprop="color"]')
would avoid raising error, but would also just return None
for the page that I inspected. The solution below uses the function but also just loops to get all items from the specs preview of the listing, the info section of the details page, and the technical details section (which has the color btw) of the details page.
firstPg, lastPg = 1, 25
car_info = [] ## AS BEFORE
renameKeys = {
'Kms': 'Kilométrage', 'Localité': 'Gouvernorat', 'Boîte vitesses': 'Boîte'
}
headers = { ## AS BEFORE
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}
session = requests.Session() ## AS BEFORE
session.headers.update(headers) ## AS BEFORE
for page_num in tqdm(range(firstPg, lastPg+1)):
url = f'https://www.automobile.tn/fr/occasion/{page_num}' ## AS BEFORE
page = session.get(url) ## AS BEFORE
soup = BeautifulSoup(page.content, "html.parser") ## AS BEFORE
car_listings = soup.find_all("div", class_="occasion-item") ## AS BEFORE
for listing in car_listings:
liDets = { k: selectTxt(listing, s) for k, s in [
('title','h2'), ('price','div.price'), ('description','p')] }
sPrvw_sel = 'span.name~span.value'
for spec in listing.select(f'ul.specs-preview>li:has({sPrvw_sel})'):
k = selectTxt(spec,'span.name:has(~span.value)').strip(':').strip()
liDets[renameKeys.get(k,k)] = selectTxt(spec,sPrvw_sel)
listingLink = listing.select_one('a[href]')
if not listingLink: ## JUST IN CASE
car_info.append(liDets)
continue
detailed_page_link = "https://www.automobile.tn" + listingLink["href"]
detailed_page = session.get(detailed_page_link) ## AS BEFORE
detailed_soup = BeautifulSoup(detailed_page.content, "html.parser") ## AS BEFORE
for info in detailed_soup.select('div.infos>ul>li:has(label~span)'):
k = selectTxt(info,'label:has(~span)').strip(':').strip()
liDets[renameKeys.get(k,k)] = selectTxt(info,'label~span')
tblSel = 'table[data-index="main"]:has(thead~tbody>tr:only-child)'
for spec in detailed_soup.select(tblSel):
k = selectTxt(spec,'thead').strip(':').strip()
liDets[renameKeys.get(k,k)] = selectTxt(spec,'tbody>tr')
# phoneSel = 'div.infos>ul>li>span:has(i.fa-phone-alt)'
# liDets['phone'] = selectTxt(detailed_soup, phoneSel)
# liDets['details_link'] = detailed_page_link
car_info.append(liDets)
Use firstPg
and lastPg
to set which pages to scrape, and renameKeys
to change any column names that are scraped from the page with k= selectTxt...
. [The 3 I've already set are just to not have duplicate columns with different names from overlaps between the specs preview and the details page.] With lastPg=2
I get
Btw, if you uncomment the last 3 lines and add two more for
loops to also include the bottom of the info section and the [collapsed] equipments section:
for desc in detailed_soup.select('div.description>p:has(b+br)'):
k = selectTxt(desc,'b:has(+br)').strip(':').strip()
desc.select_one('b:has(+br)').replace_with('') # !THIS EDITS detailed_soup
liDets[renameKeys.get(k,k)] = desc.get_text(' ', strip=True)
for e,eq in enumerate(detailed_soup.select('table[data-index="add"]')):
k = selectTxt(eq,'thead',f'Equipements {e}').strip(':').strip()
liDets[renameKeys.get(k,k)]=[selectTxt(x) for x in eq.select('td')]
phoneSel = 'div.infos>ul>li>span:has(i.fa-phone-alt)'
liDets['phone'] = selectTxt(detailed_soup, phoneSel)
liDets['details_link'] = detailed_page_link
then the DataFrame might have 7 more columns and you'll have collected pretty much all the data available on the details page [except for the comments].
If you want the comments as well, you can target them with the .dsq-widget-comment
selector
cmntList = detailed_soup.select('.dsq-widget-comment>p:only-child')
liDets['comments_list'] = [selectTxt(x)for x in cmntList]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论