英文:
Unable to scrape Category Ratings from Glassdoor
问题
I抳e scraped reviews from Glassdoor using Python. Most data extraction, such as rating, pros, cons, date, job title, and employee type, worked well. However, scraping ratings for categories faced issues.
I created the extract_star_rating
method. If a category has class name css-1mfncox e1hd5jg10
, it抯 rated 1 star; if e1hd5jg10
, it抯 2 stars, and so on. Here抯 the function:
def extract_star_rating(review, category_name):
xpath = f'//span[text()="{category_name}"]/ancestor::div[@class="common__EIReviewsRatingsStyles__RatingItemWrapper-sc-1dl5e6p-3 gdGrid"]//div[@class]'
category_div = review.find_element(By.XPATH, xpath)
class_name = category_div.get_attribute('class')
if 'css-1mfncox' in class_name:
return 1
elif 'css-1lp3h8x' in class_name:
return 2
elif 'css-k58126' in class_name:
return 3
elif 'css-94nhxw' in class_name:
return 4
else:
return 5
I encountered an error, NoSuchElementException
, likely due to incorrect XPATH selectors.
英文:
I tried scraping reviews from Glassdoor using Python. Everything worked fine for the rating, pros, cons, date, job_title, and employee_type data. But when I tried to scrape the rating of the categories, it doesn't seem to work perfectly.
I first created the extract_star_rating method because each category can all have the same class names if they have the same rate according to this condition:
if the category has a class name of css-1mfncox e1hd5jg10 then it's rated 1 star , else if e1hd5jg10"> then 2 stars ..
Here's the extract_star_rating function:
`def extract_star_rating(review, category_name):
xpath = f'//span[text()="{category_name}"]/ancestor::div[@class="common__EIReviewsRatingsStyles__RatingItemWrapper-sc-1dl5e6p-3 gdGrid"]//div[@class]'
category_div = review.find_element(By.XPATH, xpath)
class_name = category_div.get_attribute('class')
if 'css-1mfncox' in class_name:
return 1
elif 'css-1lp3h8x' in class_name:
return 2
elif 'css-k58126' in class_name:
return 3
elif 'css-94nhxw' in class_name:
return 4
else:
return 5`
Then, I called this function 6 times since it will be applied to the 6 columns of the dataframe. But I don't really know what to put in the parameters of this function when it's called.
`# loop through all pages
for i in range(1, 3697):
# visit the page
page_url = f"{url[:-4]}_P{i}.htm"
driver.get(page_url)
# get all of the review elements on the page
review_elements = driver.find_elements(by=By.XPATH, value="//div[@class='gdReview']")
# loop through each review element and extract the relevant information
for element in review_elements:
review = {}
review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
review['Culture & Values'] = extract_star_rating(element, 'Culture & Values')
review['Diversity & Inclusion'] = extract_star_rating(element, 'Diversity & Inclusion')
review['Career Opportunities'] = extract_star_rating(element, 'Career Opportunities')
review['Compensation and Benefits'] = extract_star_rating(element, 'Compensation and Benefits')
review['Senior Management'] = extract_star_rating(element, 'Senior Management')
reviews.append(review)
This is the error I get:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//span[text()="Work/Life Balance"]/ancestor::div[@class="common__EIReviewsRatingsStyles__RatingItemWrapper-sc-1dl5e6p-3 gdGrid"]//div[@
答案1
得分: 0
#1 在extract_star_rating
方法中将xpath更改为以下内容:
xpath = f'.//div[text()="category_name"]/following-sibling::div'
#2 当您遍历所有评论时,有些评分类别可能不可用,因此您还需要处理这种情况,例如:
for element in review_elements:
try:
review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
except NoSuchElementException as e:
review['Work/Life Balance'] = "N/A"
#3 还有一种情况是没有可用的评分类别,只有总评分可用,因此您可以按如下方式更新方法:
def extract_star_rating(reviewElement, category_name):
try:
reviewElement.find_element(By.XPATH, ".//aside")
except NoSuchElementException:
print("没有类别级别的评分信息")
rating = int(float(reviewElement.find_element(By.XPATH, "//span[contains(@class,'ratingNumber')]").text))
return rating
xpath = f'.//div[text()="{category_name}"]/following-sibling::div'
category_div = reviewElement.find_element(By.XPATH, xpath)
class_name = category_div.get_attribute('class')
if 'css-1mfncox' in class_name:
return 1
elif 'css-1lp3h8x' in class_name:
return 2
elif 'css-k58126' in class_name:
return 3
elif 'css-94nhxw' in class_name:
return 4
else:
return 5
这是您提供的代码的翻译部分,如果您需要更多信息或有其他问题,请随时提问。
英文:
Please make below changes in your code
#1 In extract_star_rating
Method change xpath to below
xpath = f'.//div[text()="{category_name}"]/following-sibling::div'
#2 When you are going through all reviews there are cases where some Rating categories are not available so you have to handle that as well like below, if category is not present then set it to "N/A" for e.g. this is an example where categories are nor available
for element in review_elements:
try:
review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
except NoSuchElementException as e:
review['Work/Life Balance'] = "N/A"
#3 There is also a use case where there are no categories available at all only Total Rating is available, so in that case we will check if Category level rating is available otherwise return the total Rating added by user
Updated Method for this
def extract_star_rating(reviewElement, category_name):
# Checking if Rating by Category is available
try:
reviewElement.find_element(By.XPATH, ".//aside")
except NoSuchElementException:
# Since Exception is thrown here that means Rating by Category is Not available so return total Rating
print("No Category level Rating Info")
rating = int(float(reviewElement.find_element(By.XPATH, "//span[contains(@class,'ratingNumber')]").text))
return rating
xpath = f'.//div[text()="{category_name}"]/following-sibling::div'
# Processing as Rating by Category is available
category_div = reviewElement.find_element(By.XPATH, xpath)
class_name = category_div.get_attribute('class')
if 'css-1mfncox' in class_name:
return 1
elif 'css-1lp3h8x' in class_name:
return 2
elif 'css-k58126' in class_name:
return 3
elif 'css-94nhxw' in class_name:
return 4
else:
return 5
Full Code which i have tested for the page added below , you can edit the code to add a for loop to scrape all page for you url , i have added a sample example for a single page
from selenium.webdriver.common.by import By
import undetected_chromedriver
from selenium.common import NoSuchElementException
base_url = 'https://www.glassdoor.co.in/Reviews/Cognizant-Technology-Solutions-Reviews-E8014_P3.htm?filter.iso3Language=eng'
page_count = 442
driver = undetected_chromedriver.Chrome()
driver.get(base_url)
# get all of the review elements on the page
review_elements = driver.find_elements(by=By.XPATH, value="//div[@class='gdReview']")
# loop through each review element and extract the relevant information
reviews = []
def extract_star_rating(reviewElement, category_name):
# Checking if Rating by Category is available
try:
reviewElement.find_element(By.XPATH, ".//aside")
except NoSuchElementException:
# Since Exception is thrown here that means Rating by Category is Not available so return total Rating
print("No Category level Rating Info")
rating = int(float(reviewElement.find_element(By.XPATH, "//span[contains(@class,'ratingNumber')]").text))
return rating
xpath = f'.//div[text()="{category_name}"]/following-sibling::div'
# Processing as Rating by Category is available
category_div = reviewElement.find_element(By.XPATH, xpath)
class_name = category_div.get_attribute('class')
if 'css-1mfncox' in class_name:
return 1
elif 'css-1lp3h8x' in class_name:
return 2
elif 'css-k58126' in class_name:
return 3
elif 'css-94nhxw' in class_name:
return 4
else:
return 5
for element in review_elements:
review = {}
try:
review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
except NoSuchElementException as e:
review['Work/Life Balance'] = "N/A"
try:
review['Culture & Values'] = extract_star_rating(element, 'Culture & Values')
except NoSuchElementException as e:
review['Culture & Values'] = "N/A"
try:
review['Diversity & Inclusion'] = extract_star_rating(element, 'Diversity and Inclusion')
except NoSuchElementException as e:
review['Diversity & Inclusion'] = "N/A"
try:
review['Career Opportunities'] = extract_star_rating(element, 'Career Opportunities')
except NoSuchElementException as e:
review['Career Opportunities'] = "N/A"
try:
review['Compensation and Benefits'] = extract_star_rating(element, 'Compensation and Benefits')
except NoSuchElementException as e:
review['Compensation and Benefits'] = "N/A"
try:
review['Senior Management'] = extract_star_rating(element, 'Senior Management')
except Exception as e:
review['Senior Management'] = "N/A"
reviews.append(review)
for r in reviews:
print(r)
It will extract all Ratings and print them
{'Work/Life Balance': 2, 'Culture & Values': 4, 'Diversity & Inclusion': 4, 'Career Opportunities': 4, 'Compensation and Benefits': 4, 'Senior Management': 4}
{'Work/Life Balance': 3, 'Culture & Values': 3, 'Diversity & Inclusion': 3, 'Career Opportunities': 3, 'Compensation and Benefits': 3, 'Senior Management': 3}
{'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 3, 'Compensation and Benefits': 5, 'Senior Management': 4}
{'Work/Life Balance': 2, 'Culture & Values': 2, 'Diversity & Inclusion': 5, 'Career Opportunities': 4, 'Compensation and Benefits': 1, 'Senior Management': 2}
{'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 5, 'Compensation and Benefits': 3, 'Senior Management': 5}
{'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 5, 'Compensation and Benefits': 4, 'Senior Management': 5}
{'Work/Life Balance': 2, 'Culture & Values': 2, 'Diversity & Inclusion': 2, 'Career Opportunities': 2, 'Compensation and Benefits': 2, 'Senior Management': 2}
{'Work/Life Balance': 3, 'Culture & Values': 3, 'Diversity & Inclusion': 3, 'Career Opportunities': 3, 'Compensation and Benefits': 3, 'Senior Management': 3}
{'Work/Life Balance': 4, 'Culture & Values': 4, 'Diversity & Inclusion': 3, 'Career Opportunities': 3, 'Compensation and Benefits': 4, 'Senior Management': 2}
{'Work/Life Balance': 2, 'Culture & Values': 3, 'Diversity & Inclusion': 4, 'Career Opportunities': 3, 'Compensation and Benefits': 4, 'Senior Management': 2}
Incase some categories are missing we will get
{'Work/Life Balance': 'N/A', 'Culture & Values': 'N/A', 'Diversity & Inclusion': 'N/A', 'Career Opportunities': 1, 'Compensation and Benefits': 'N/A', 'Senior Management': 'N/A'}
{'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 5, 'Compensation and Benefits': 5, 'Senior Management': 5}
{'Work/Life Balance': 1, 'Culture & Values': 1, 'Diversity & Inclusion': 1, 'Career Opportunities': 1, 'Compensation and Benefits': 1, 'Senior Management': 1}
Note - There may still be more use cases you would need to handle related to ratings in some other pages
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论