无法从Glassdoor上爬取类别评级。

huangapple go评论62阅读模式
英文:

Unable to scrape Category Ratings from Glassdoor

问题

I抳e scraped reviews from Glassdoor using Python. Most data extraction, such as rating, pros, cons, date, job title, and employee type, worked well. However, scraping ratings for categories faced issues.

I created the extract_star_rating method. If a category has class name css-1mfncox e1hd5jg10, it抯 rated 1 star; if e1hd5jg10, it抯 2 stars, and so on. Here抯 the function:

def extract_star_rating(review, category_name):
    xpath = f'//span[text()="{category_name}"]/ancestor::div[@class="common__EIReviewsRatingsStyles__RatingItemWrapper-sc-1dl5e6p-3 gdGrid"]//div[@class]'
    category_div = review.find_element(By.XPATH, xpath)
    class_name = category_div.get_attribute('class')
    if 'css-1mfncox' in class_name:
        return 1
    elif 'css-1lp3h8x' in class_name:
        return 2
    elif 'css-k58126' in class_name:
        return 3
    elif 'css-94nhxw' in class_name:
        return 4
    else:
        return 5

I encountered an error, NoSuchElementException, likely due to incorrect XPATH selectors.

英文:

I tried scraping reviews from Glassdoor using Python. Everything worked fine for the rating, pros, cons, date, job_title, and employee_type data. But when I tried to scrape the rating of the categories, it doesn't seem to work perfectly.

I first created the extract_star_rating method because each category can all have the same class names if they have the same rate according to this condition:

if the category has a class name of css-1mfncox e1hd5jg10 then it's rated 1 star , else if e1hd5jg10"> then 2 stars ..

Here's the extract_star_rating function:

`def extract_star_rating(review, category_name):
    xpath = f'//span[text()="{category_name}"]/ancestor::div[@class="common__EIReviewsRatingsStyles__RatingItemWrapper-sc-1dl5e6p-3 gdGrid"]//div[@class]'
    category_div = review.find_element(By.XPATH, xpath)
    class_name = category_div.get_attribute('class')
    if 'css-1mfncox' in class_name:
        return 1
    elif 'css-1lp3h8x' in class_name:
        return 2
    elif 'css-k58126' in class_name:
        return 3
    elif 'css-94nhxw' in class_name:
        return 4
    else:
        return 5`

Then, I called this function 6 times since it will be applied to the 6 columns of the dataframe. But I don't really know what to put in the parameters of this function when it's called.

`# loop through all pages
for i in range(1, 3697):
    # visit the page
    page_url = f"{url[:-4]}_P{i}.htm"
    driver.get(page_url)
    # get all of the review elements on the page
    review_elements = driver.find_elements(by=By.XPATH, value="//div[@class='gdReview']")
    # loop through each review element and extract the relevant information
    for element in review_elements:
        review = {}
        review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
        review['Culture & Values'] = extract_star_rating(element, 'Culture & Values')
        review['Diversity & Inclusion'] = extract_star_rating(element, 'Diversity & Inclusion')
        review['Career Opportunities'] = extract_star_rating(element, 'Career Opportunities')
        review['Compensation and Benefits'] = extract_star_rating(element, 'Compensation and Benefits')
        review['Senior Management'] = extract_star_rating(element, 'Senior Management')
        reviews.append(review)

This is the error I get:

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//span[text()="Work/Life Balance"]/ancestor::div[@class="common__EIReviewsRatingsStyles__RatingItemWrapper-sc-1dl5e6p-3 gdGrid"]//div[@

答案1

得分: 0

#1 在extract_star_rating方法中将xpath更改为以下内容:

xpath = f'.//div[text()="category_name"]/following-sibling::div'

#2 当您遍历所有评论时,有些评分类别可能不可用,因此您还需要处理这种情况,例如:

for element in review_elements:
    try:
        review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
    except NoSuchElementException as e:
        review['Work/Life Balance'] = "N/A"

#3 还有一种情况是没有可用的评分类别,只有总评分可用,因此您可以按如下方式更新方法:

def extract_star_rating(reviewElement, category_name):
    try:
        reviewElement.find_element(By.XPATH, ".//aside")
    except NoSuchElementException:
        print("没有类别级别的评分信息")
        rating = int(float(reviewElement.find_element(By.XPATH, "//span[contains(@class,'ratingNumber')]").text))
        return rating
    xpath = f'.//div[text()="{category_name}"]/following-sibling::div'

    category_div = reviewElement.find_element(By.XPATH, xpath)
    class_name = category_div.get_attribute('class')
    if 'css-1mfncox' in class_name:
        return 1
    elif 'css-1lp3h8x' in class_name:
        return 2
    elif 'css-k58126' in class_name:
        return 3
    elif 'css-94nhxw' in class_name:
        return 4
    else:
        return 5

这是您提供的代码的翻译部分,如果您需要更多信息或有其他问题,请随时提问。

英文:

Please make below changes in your code

#1 In extract_star_rating Method change xpath to below

xpath = f'.//div[text()="{category_name}"]/following-sibling::div'

#2 When you are going through all reviews there are cases where some Rating categories are not available so you have to handle that as well like below, if category is not present then set it to "N/A" for e.g. this is an example where categories are nor available
for element in review_elements:

 try:
        review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
    except NoSuchElementException as e:

        review['Work/Life Balance'] = "N/A"

#3 There is also a use case where there are no categories available at all only Total Rating is available, so in that case we will check if Category level rating is available otherwise return the total Rating added by user
Updated Method for this

def extract_star_rating(reviewElement, category_name):
# Checking if Rating by Category is available
try:
    reviewElement.find_element(By.XPATH, ".//aside")
except NoSuchElementException:
    # Since Exception is thrown here that means Rating by Category is Not available so return total Rating
    print("No Category level Rating Info")
    rating = int(float(reviewElement.find_element(By.XPATH, "//span[contains(@class,'ratingNumber')]").text))
    return rating
xpath = f'.//div[text()="{category_name}"]/following-sibling::div'

# Processing as Rating by Category is available
category_div = reviewElement.find_element(By.XPATH, xpath)
class_name = category_div.get_attribute('class')
if 'css-1mfncox' in class_name:
    return 1
elif 'css-1lp3h8x' in class_name:
    return 2
elif 'css-k58126' in class_name:
    return 3
elif 'css-94nhxw' in class_name:
    return 4
else:
    return 5

Full Code which i have tested for the page added below , you can edit the code to add a for loop to scrape all page for you url , i have added a sample example for a single page

from selenium.webdriver.common.by import By
import undetected_chromedriver
from selenium.common import NoSuchElementException

base_url = 'https://www.glassdoor.co.in/Reviews/Cognizant-Technology-Solutions-Reviews-E8014_P3.htm?filter.iso3Language=eng'
page_count = 442

driver = undetected_chromedriver.Chrome()
driver.get(base_url)
# get all of the review elements on the page
review_elements = driver.find_elements(by=By.XPATH, value="//div[@class='gdReview']")

# loop through each review element and extract the relevant information
reviews = []


def extract_star_rating(reviewElement, category_name):
    # Checking if Rating by Category is available
    try:
        reviewElement.find_element(By.XPATH, ".//aside")
    except NoSuchElementException:
        # Since Exception is thrown here that means Rating by Category is Not available so return total Rating
        print("No Category level Rating Info")
        rating = int(float(reviewElement.find_element(By.XPATH, "//span[contains(@class,'ratingNumber')]").text))
        return rating
    xpath = f'.//div[text()="{category_name}"]/following-sibling::div'

    # Processing as Rating by Category is available
    category_div = reviewElement.find_element(By.XPATH, xpath)
    class_name = category_div.get_attribute('class')
    if 'css-1mfncox' in class_name:
        return 1
    elif 'css-1lp3h8x' in class_name:
        return 2
    elif 'css-k58126' in class_name:
        return 3
    elif 'css-94nhxw' in class_name:
        return 4
    else:
        return 5


for element in review_elements:
    review = {}
    try:
        review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
    except NoSuchElementException as e:
        review['Work/Life Balance'] = "N/A"

    try:
        review['Culture & Values'] = extract_star_rating(element, 'Culture & Values')
    except NoSuchElementException as e:
        review['Culture & Values'] = "N/A"

    try:
        review['Diversity & Inclusion'] = extract_star_rating(element, 'Diversity and Inclusion')
    except NoSuchElementException as e:
        review['Diversity & Inclusion'] = "N/A"

    try:
        review['Career Opportunities'] = extract_star_rating(element, 'Career Opportunities')
    except NoSuchElementException as e:
        review['Career Opportunities'] = "N/A"

    try:
        review['Compensation and Benefits'] = extract_star_rating(element, 'Compensation and Benefits')
    except NoSuchElementException as e:
        review['Compensation and Benefits'] = "N/A"

    try:
        review['Senior Management'] = extract_star_rating(element, 'Senior Management')
    except Exception as e:
        review['Senior Management'] = "N/A"

    reviews.append(review)
for r in reviews:
    print(r)

It will extract all Ratings and print them

{'Work/Life Balance': 2, 'Culture & Values': 4, 'Diversity & Inclusion': 4, 'Career Opportunities': 4, 'Compensation and Benefits': 4, 'Senior Management': 4}
{'Work/Life Balance': 3, 'Culture & Values': 3, 'Diversity & Inclusion': 3, 'Career Opportunities': 3, 'Compensation and Benefits': 3, 'Senior Management': 3}
{'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 3, 'Compensation and Benefits': 5, 'Senior Management': 4}
{'Work/Life Balance': 2, 'Culture & Values': 2, 'Diversity & Inclusion': 5, 'Career Opportunities': 4, 'Compensation and Benefits': 1, 'Senior Management': 2}
{'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 5, 'Compensation and Benefits': 3, 'Senior Management': 5}
{'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 5, 'Compensation and Benefits': 4, 'Senior Management': 5}
{'Work/Life Balance': 2, 'Culture & Values': 2, 'Diversity & Inclusion': 2, 'Career Opportunities': 2, 'Compensation and Benefits': 2, 'Senior Management': 2}
{'Work/Life Balance': 3, 'Culture & Values': 3, 'Diversity & Inclusion': 3, 'Career Opportunities': 3, 'Compensation and Benefits': 3, 'Senior Management': 3}
{'Work/Life Balance': 4, 'Culture & Values': 4, 'Diversity & Inclusion': 3, 'Career Opportunities': 3, 'Compensation and Benefits': 4, 'Senior Management': 2}
{'Work/Life Balance': 2, 'Culture & Values': 3, 'Diversity & Inclusion': 4, 'Career Opportunities': 3, 'Compensation and Benefits': 4, 'Senior Management': 2}

Incase some categories are missing we will get

{'Work/Life Balance': 'N/A', 'Culture & Values': 'N/A', 'Diversity & Inclusion': 'N/A', 'Career Opportunities': 1, 'Compensation and Benefits': 'N/A', 'Senior Management': 'N/A'}
{'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 5, 'Compensation and Benefits': 5, 'Senior Management': 5}
{'Work/Life Balance': 1, 'Culture & Values': 1, 'Diversity & Inclusion': 1, 'Career Opportunities': 1, 'Compensation and Benefits': 1, 'Senior Management': 1}

Note - There may still be more use cases you would need to handle related to ratings in some other pages

huangapple
  • 本文由 发表于 2023年5月14日 00:14:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76243731.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定