2023年5月14日 00:14:16go评论68阅读模式

英文:

Unable to scrape Category Ratings from Glassdoor

问题

I抳e scraped reviews from Glassdoor using Python. Most data extraction, such as rating, pros, cons, date, job title, and employee type, worked well. However, scraping ratings for categories faced issues.

I created the extract_star_rating method. If a category has class name css-1mfncox e1hd5jg10, it抯 rated 1 star; if e1hd5jg10, it抯 2 stars, and so on. Here抯 the function:

def extract_star_rating(review, category_name):
    xpath = f'//span[text()="{category_name}"]/ancestor::div[@class="common__EIReviewsRatingsStyles__RatingItemWrapper-sc-1dl5e6p-3 gdGrid"]//div[@class]'
    category_div = review.find_element(By.XPATH, xpath)
    class_name = category_div.get_attribute('class')
    if 'css-1mfncox' in class_name:
        return 1
    elif 'css-1lp3h8x' in class_name:
        return 2
    elif 'css-k58126' in class_name:
        return 3
    elif 'css-94nhxw' in class_name:
        return 4
    else:
        return 5

I encountered an error, NoSuchElementException, likely due to incorrect XPATH selectors.

英文:

I tried scraping reviews from Glassdoor using Python. Everything worked fine for the rating, pros, cons, date, job_title, and employee_type data. But when I tried to scrape the rating of the categories, it doesn't seem to work perfectly.

I first created the extract_star_rating method because each category can all have the same class names if they have the same rate according to this condition:

if the category has a class name of css-1mfncox e1hd5jg10 then it's rated 1 star , else if e1hd5jg10"> then 2 stars ..

Here's the extract_star_rating function:

`def extract_star_rating(review, category_name):
    xpath = f&#39;//span[text()=&quot;{category_name}&quot;]/ancestor::div[@class=&quot;common__EIReviewsRatingsStyles__RatingItemWrapper-sc-1dl5e6p-3 gdGrid&quot;]//div[@class]&#39;
    category_div = review.find_element(By.XPATH, xpath)
    class_name = category_div.get_attribute(&#39;class&#39;)
    if &#39;css-1mfncox&#39; in class_name:
        return 1
    elif &#39;css-1lp3h8x&#39; in class_name:
        return 2
    elif &#39;css-k58126&#39; in class_name:
        return 3
    elif &#39;css-94nhxw&#39; in class_name:
        return 4
    else:
        return 5`

Then, I called this function 6 times since it will be applied to the 6 columns of the dataframe. But I don't really know what to put in the parameters of this function when it's called.

`# loop through all pages
for i in range(1, 3697):
    # visit the page
    page_url = f&quot;{url[:-4]}_P{i}.htm&quot;
    driver.get(page_url)
    # get all of the review elements on the page
    review_elements = driver.find_elements(by=By.XPATH, value=&quot;//div[@class=&#39;gdReview&#39;]&quot;)
    # loop through each review element and extract the relevant information
    for element in review_elements:
        review = {}
        review[&#39;Work/Life Balance&#39;] = extract_star_rating(element, &#39;Work/Life Balance&#39;)
        review[&#39;Culture &amp; Values&#39;] = extract_star_rating(element, &#39;Culture &amp; Values&#39;)
        review[&#39;Diversity &amp; Inclusion&#39;] = extract_star_rating(element, &#39;Diversity &amp; Inclusion&#39;)
        review[&#39;Career Opportunities&#39;] = extract_star_rating(element, &#39;Career Opportunities&#39;)
        review[&#39;Compensation and Benefits&#39;] = extract_star_rating(element, &#39;Compensation and Benefits&#39;)
        review[&#39;Senior Management&#39;] = extract_star_rating(element, &#39;Senior Management&#39;)
        reviews.append(review)

This is the error I get:

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//span[text()="Work/Life Balance"]/ancestor::div[@class="common__EIReviewsRatingsStyles__RatingItemWrapper-sc-1dl5e6p-3 gdGrid"]//div[@

答案1

得分: 0

#1 在extract_star_rating方法中将xpath更改为以下内容：

xpath = f'.//div[text()="category_name"]/following-sibling::div'

#2 当您遍历所有评论时，有些评分类别可能不可用，因此您还需要处理这种情况，例如：

for element in review_elements:
    try:
        review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
    except NoSuchElementException as e:
        review['Work/Life Balance'] = "N/A"

#3 还有一种情况是没有可用的评分类别，只有总评分可用，因此您可以按如下方式更新方法：

def extract_star_rating(reviewElement, category_name):
    try:
        reviewElement.find_element(By.XPATH, ".//aside")
    except NoSuchElementException:
        print("没有类别级别的评分信息")
        rating = int(float(reviewElement.find_element(By.XPATH, "//span[contains(@class,'ratingNumber')]").text))
        return rating
    xpath = f'.//div[text()="{category_name}"]/following-sibling::div'

    category_div = reviewElement.find_element(By.XPATH, xpath)
    class_name = category_div.get_attribute('class')
    if 'css-1mfncox' in class_name:
        return 1
    elif 'css-1lp3h8x' in class_name:
        return 2
    elif 'css-k58126' in class_name:
        return 3
    elif 'css-94nhxw' in class_name:
        return 4
    else:
        return 5

这是您提供的代码的翻译部分，如果您需要更多信息或有其他问题，请随时提问。

英文:

Please make below changes in your code

#1 In extract_star_rating Method change xpath to below

xpath = f&#39;.//div[text()=&quot;{category_name}&quot;]/following-sibling::div&#39;

#2 When you are going through all reviews there are cases where some Rating categories are not available so you have to handle that as well like below, if category is not present then set it to "N/A" for e.g. this is an example where categories are nor available
for element in review_elements:

 try:
        review[&#39;Work/Life Balance&#39;] = extract_star_rating(element, &#39;Work/Life Balance&#39;)
    except NoSuchElementException as e:

        review[&#39;Work/Life Balance&#39;] = &quot;N/A&quot;

#3 There is also a use case where there are no categories available at all only Total Rating is available, so in that case we will check if Category level rating is available otherwise return the total Rating added by user
Updated Method for this

def extract_star_rating(reviewElement, category_name):
# Checking if Rating by Category is available
try:
    reviewElement.find_element(By.XPATH, &quot;.//aside&quot;)
except NoSuchElementException:
    # Since Exception is thrown here that means Rating by Category is Not available so return total Rating
    print(&quot;No Category level Rating Info&quot;)
    rating = int(float(reviewElement.find_element(By.XPATH, &quot;//span[contains(@class,&#39;ratingNumber&#39;)]&quot;).text))
    return rating
xpath = f&#39;.//div[text()=&quot;{category_name}&quot;]/following-sibling::div&#39;

# Processing as Rating by Category is available
category_div = reviewElement.find_element(By.XPATH, xpath)
class_name = category_div.get_attribute(&#39;class&#39;)
if &#39;css-1mfncox&#39; in class_name:
    return 1
elif &#39;css-1lp3h8x&#39; in class_name:
    return 2
elif &#39;css-k58126&#39; in class_name:
    return 3
elif &#39;css-94nhxw&#39; in class_name:
    return 4
else:
    return 5

Full Code which i have tested for the page added below , you can edit the code to add a for loop to scrape all page for you url , i have added a sample example for a single page

from selenium.webdriver.common.by import By
import undetected_chromedriver
from selenium.common import NoSuchElementException

base_url = &#39;https://www.glassdoor.co.in/Reviews/Cognizant-Technology-Solutions-Reviews-E8014_P3.htm?filter.iso3Language=eng&#39;
page_count = 442

driver = undetected_chromedriver.Chrome()
driver.get(base_url)
# get all of the review elements on the page
review_elements = driver.find_elements(by=By.XPATH, value=&quot;//div[@class=&#39;gdReview&#39;]&quot;)

# loop through each review element and extract the relevant information
reviews = []


def extract_star_rating(reviewElement, category_name):
    # Checking if Rating by Category is available
    try:
        reviewElement.find_element(By.XPATH, &quot;.//aside&quot;)
    except NoSuchElementException:
        # Since Exception is thrown here that means Rating by Category is Not available so return total Rating
        print(&quot;No Category level Rating Info&quot;)
        rating = int(float(reviewElement.find_element(By.XPATH, &quot;//span[contains(@class,&#39;ratingNumber&#39;)]&quot;).text))
        return rating
    xpath = f&#39;.//div[text()=&quot;{category_name}&quot;]/following-sibling::div&#39;

    # Processing as Rating by Category is available
    category_div = reviewElement.find_element(By.XPATH, xpath)
    class_name = category_div.get_attribute(&#39;class&#39;)
    if &#39;css-1mfncox&#39; in class_name:
        return 1
    elif &#39;css-1lp3h8x&#39; in class_name:
        return 2
    elif &#39;css-k58126&#39; in class_name:
        return 3
    elif &#39;css-94nhxw&#39; in class_name:
        return 4
    else:
        return 5


for element in review_elements:
    review = {}
    try:
        review[&#39;Work/Life Balance&#39;] = extract_star_rating(element, &#39;Work/Life Balance&#39;)
    except NoSuchElementException as e:
        review[&#39;Work/Life Balance&#39;] = &quot;N/A&quot;

    try:
        review[&#39;Culture &amp; Values&#39;] = extract_star_rating(element, &#39;Culture &amp; Values&#39;)
    except NoSuchElementException as e:
        review[&#39;Culture &amp; Values&#39;] = &quot;N/A&quot;

    try:
        review[&#39;Diversity &amp; Inclusion&#39;] = extract_star_rating(element, &#39;Diversity and Inclusion&#39;)
    except NoSuchElementException as e:
        review[&#39;Diversity &amp; Inclusion&#39;] = &quot;N/A&quot;

    try:
        review[&#39;Career Opportunities&#39;] = extract_star_rating(element, &#39;Career Opportunities&#39;)
    except NoSuchElementException as e:
        review[&#39;Career Opportunities&#39;] = &quot;N/A&quot;

    try:
        review[&#39;Compensation and Benefits&#39;] = extract_star_rating(element, &#39;Compensation and Benefits&#39;)
    except NoSuchElementException as e:
        review[&#39;Compensation and Benefits&#39;] = &quot;N/A&quot;

    try:
        review[&#39;Senior Management&#39;] = extract_star_rating(element, &#39;Senior Management&#39;)
    except Exception as e:
        review[&#39;Senior Management&#39;] = &quot;N/A&quot;

    reviews.append(review)
for r in reviews:
    print(r)

It will extract all Ratings and print them

{&#39;Work/Life Balance&#39;: 2, &#39;Culture &amp; Values&#39;: 4, &#39;Diversity &amp; Inclusion&#39;: 4, &#39;Career Opportunities&#39;: 4, &#39;Compensation and Benefits&#39;: 4, &#39;Senior Management&#39;: 4}
{&#39;Work/Life Balance&#39;: 3, &#39;Culture &amp; Values&#39;: 3, &#39;Diversity &amp; Inclusion&#39;: 3, &#39;Career Opportunities&#39;: 3, &#39;Compensation and Benefits&#39;: 3, &#39;Senior Management&#39;: 3}
{&#39;Work/Life Balance&#39;: 5, &#39;Culture &amp; Values&#39;: 5, &#39;Diversity &amp; Inclusion&#39;: 5, &#39;Career Opportunities&#39;: 3, &#39;Compensation and Benefits&#39;: 5, &#39;Senior Management&#39;: 4}
{&#39;Work/Life Balance&#39;: 2, &#39;Culture &amp; Values&#39;: 2, &#39;Diversity &amp; Inclusion&#39;: 5, &#39;Career Opportunities&#39;: 4, &#39;Compensation and Benefits&#39;: 1, &#39;Senior Management&#39;: 2}
{&#39;Work/Life Balance&#39;: 5, &#39;Culture &amp; Values&#39;: 5, &#39;Diversity &amp; Inclusion&#39;: 5, &#39;Career Opportunities&#39;: 5, &#39;Compensation and Benefits&#39;: 3, &#39;Senior Management&#39;: 5}
{&#39;Work/Life Balance&#39;: 5, &#39;Culture &amp; Values&#39;: 5, &#39;Diversity &amp; Inclusion&#39;: 5, &#39;Career Opportunities&#39;: 5, &#39;Compensation and Benefits&#39;: 4, &#39;Senior Management&#39;: 5}
{&#39;Work/Life Balance&#39;: 2, &#39;Culture &amp; Values&#39;: 2, &#39;Diversity &amp; Inclusion&#39;: 2, &#39;Career Opportunities&#39;: 2, &#39;Compensation and Benefits&#39;: 2, &#39;Senior Management&#39;: 2}
{&#39;Work/Life Balance&#39;: 3, &#39;Culture &amp; Values&#39;: 3, &#39;Diversity &amp; Inclusion&#39;: 3, &#39;Career Opportunities&#39;: 3, &#39;Compensation and Benefits&#39;: 3, &#39;Senior Management&#39;: 3}
{&#39;Work/Life Balance&#39;: 4, &#39;Culture &amp; Values&#39;: 4, &#39;Diversity &amp; Inclusion&#39;: 3, &#39;Career Opportunities&#39;: 3, &#39;Compensation and Benefits&#39;: 4, &#39;Senior Management&#39;: 2}
{&#39;Work/Life Balance&#39;: 2, &#39;Culture &amp; Values&#39;: 3, &#39;Diversity &amp; Inclusion&#39;: 4, &#39;Career Opportunities&#39;: 3, &#39;Compensation and Benefits&#39;: 4, &#39;Senior Management&#39;: 2}

Incase some categories are missing we will get

{&#39;Work/Life Balance&#39;: &#39;N/A&#39;, &#39;Culture &amp; Values&#39;: &#39;N/A&#39;, &#39;Diversity &amp; Inclusion&#39;: &#39;N/A&#39;, &#39;Career Opportunities&#39;: 1, &#39;Compensation and Benefits&#39;: &#39;N/A&#39;, &#39;Senior Management&#39;: &#39;N/A&#39;}
{&#39;Work/Life Balance&#39;: 5, &#39;Culture &amp; Values&#39;: 5, &#39;Diversity &amp; Inclusion&#39;: 5, &#39;Career Opportunities&#39;: 5, &#39;Compensation and Benefits&#39;: 5, &#39;Senior Management&#39;: 5}
{&#39;Work/Life Balance&#39;: 1, &#39;Culture &amp; Values&#39;: 1, &#39;Diversity &amp; Inclusion&#39;: 1, &#39;Career Opportunities&#39;: 1, &#39;Compensation and Benefits&#39;: 1, &#39;Senior Management&#39;: 1}

Note - There may still be more use cases you would need to handle related to ratings in some other pages

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

无法从Glassdoor上爬取类别评级。

问题

答案1

在Selenium Java中滚动到一个元素失败了。

我无法在Selenium（Python）中使用send_keys。

遇到使用Scrapy时被阻止（使用用户代理）

Beautiful Soup爬取时缺少输出文本 – 如何提取它？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论