英文:
R Glassdoor Web Scraping
问题
I have been tasked to collect Glassdoor Reviews for different hospitals and I am having difficulties extracting the Pros, Cons, Advice to management, Recommend, CEO Approval, Business Outlook, and the small rating drop down. I have been able to extract the rest from the code below. Any help would be greatly appreciated.
英文:
I have been tasked to collect Glassdoor Reviews for different hospitals and I am having difficulties extracting the Pros, Cons, Advice to management, Recommend, CEO Approval, Business Outlook, and the small rating drop down. I have been able to extract the rest from the code below. Any help would be greatly appreciated.
library(rvest)
library(tidyverse)
library(tidyverse)
library(stringr)
url <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm? sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
page <- read_html(url)
# Extract review titles
review_titles <- page %>%
html_nodes(".reviewLink") %>%
html_text()
# Extract review dates
review_dates <- page %>%
html_nodes(".middle.common__EiReviewDetailsStyle__newGrey") %>%
html_text()
#Extract Pros
review_pros <- page %>%
html_nodes("v2__EIReviewDetailsV2__fullWidth ") %>%
html_text()
print(review_pros)
# Extract review ratings
review_ratings <- page %>%
html_nodes(".ratingNumber.mr-xsm") %>%
html_text() %>%
str_extract("\d+") %>%
as.integer()
# Extract review recommendations
recommendations <- page %>%
html_nodes("html body.main.loggedIn.lang-en.en-US.gdGrid._initOk div#Container div.container-max-width.mx-auto.px-0.px-lg-lg.py-lg-xxl div.d-flex.row.css-zwxlu7.e1af7d9i0 main.col-12.mb-lg-0.mb-md.css-yaeagj.ej1dgw00 div#ReviewsRef div#ReviewsFeed ol.empReviews.emp-reviews-feed.pl-0 li#empReview_76309432.noBorder.empReview.cf.pb-0.mb-0 div.p-0.mb-0.mb-md-std.css-w5wad1.gd-ui-module.css-rntt2a.ec4dwm00 div.gdReview div.mt-xxsm div.mx-0 div.px-std div div.d-flex.my-std.reviewBodyCell.recommends.css-1y3jl3a.e1868oi10") %>%
html_text()
# Convert recommendations to numeric values
recommendations_numeric <- ifelse(grepl("css-hcqxoa-svg", recommendations), 1,
ifelse(grepl("css-1y3jl3a-svg", recommendations), -1, 0))
# Create data frame
reviews <- data.frame(Title = review_titles, Rating = review_ratings, Date = review_dates)
# View data frame
reviews
答案1
得分: 1
我能够提取出优缺点如下:
library(tidyverse)
library(rvest)
data <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?%20%20%20%20%20%20%20%20%20%20%20%20sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng" %>%
read_html() %>%
html_elements(".empReview")
tibble(
title = data %>%
html_element(".reviewLink") %>%
html_text2(),
date = data %>%
html_element(".middle.common__EiReviewDetailsStyle__newGrey") %>%
html_text2(),
pros = data %>%
html_element(".v2__EIReviewDetailsV2__fullWidth:nth-child(1) span") %>%
html_text2(),
cons = data %>%
html_element(".v2__EIReviewDetailsV2__fullWidth+ .v2__EIReviewDetailsV2__fullWidth span") %>%
html_text2() %>%
str_trim()
) %>%
separate(col = date, into = c("date", "position"), sep = " - ")
# A tibble: 10 × 5
title date position pros cons
<chr> <chr> <chr> <chr> <chr>
1 Great place to work Mar 3, 2023 Manager Excellent… "Do …
2 Don't bother May 10, 2023 Practice Manager Being loc… "Hor…
3 Skeleton staffing Nov 16, 2022 Registered Nurse, Emergency Room Co-worker… "No …
4 Nyack hospital Mar 7, 2023 Patient Care Associate (PCA) The food … "No …
5 Its ok Mar 22, 2023 Registered Nurse, BSN one weeke… "sho…
6 pca Jan 18, 2023 Patient Care Assistant (PCA) good pay … "non…
7 Just for starters Feb 3, 2023 Registered Nurse, Critical Care Coworkers… "No …
8 PCA Oct 22, 2022 Emergency Care Assistant there sta… "the…
9 Great way to support the Hospital Sep 16, 2022 Donor Relations Manager Most ever… "Lon…
10 great place to work Sep 5, 2022 Registered Nurse lots of o… "lim…
英文:
I was able to pull pros and cons as such:
library(tidyverse)
library(rvest)
data <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?%20%20%20%20%20%20%20%20%20%20%20%20sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng" %>%
read_html() %>%
html_elements(".empReview")
tibble(
title = data %>%
html_element(".reviewLink") %>%
html_text2(),
date = data %>%
html_element(".middle.common__EiReviewDetailsStyle__newGrey") %>%
html_text2(),
pros = data %>%
html_element(".v2__EIReviewDetailsV2__fullWidth:nth-child(1) span") %>%
html_text2(),
cons = data %>%
html_element(".v2__EIReviewDetailsV2__fullWidth+ .v2__EIReviewDetailsV2__fullWidth span") %>%
html_text2() %>%
str_trim()
) %>%
separate(col = date, into = c("date", "position"), sep = " - ")
# A tibble: 10 × 5
title date position pros cons
<chr> <chr> <chr> <chr> <chr>
1 Great place to work Mar 3, 2023 Manager Excellent… "Do …
2 Don't bother May 10, 2023 Practice Manager Being loc… "Hor…
3 Skeleton staffing Nov 16, 2022 Registered Nurse, Emergency Room Co-worker… "No …
4 Nyack hospital Mar 7, 2023 Patient Care Associate (PCA) The food … "No …
5 Its ok Mar 22, 2023 Registered Nurse, BSN one weeke… "sho…
6 pca Jan 18, 2023 Patient Care Assistant (PCA) good pay … "non…
7 Just for starters Feb 3, 2023 Registered Nurse, Critical Care Coworkers… "No …
8 PCA Oct 22, 2022 Emergency Care Assistant there sta… "the…
9 Great way to support the Hospital Sep 16, 2022 Donor Relations Manager Most ever… "Lon…
10 great place to work Sep 5, 2022 Registered Nurse lots of o… "lim…
答案2
得分: 0
抱歉,我只能提供中文翻译。以下是您提供的内容的中文翻译:
你要查找的数据存储在一个脚本中。这个答案基于一个类似的问题。链接:https://stackoverflow.com/questions/72835014/web-scraping-data-that-is-not-displayed-on-a-webpage-using-rvest
花了一段时间来搜索和试错才得到正确的结果。在脚本中,有一个以 "reviews": 开头,以 }]} 结束的部分。在这种情况下,它出现在第二次出现的“reviews”后面。只需将这部分提取出来并转换为JSON格式。
library(stringr)
library(xml2)
library(rvest)
library(dplyr)
url <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
page <- read_html(url)
# 评分存储在脚本的数据结构中
# 找到所有的脚本然后搜索
scripts <- page %>% html_elements(xpath='//script')
# 在脚本中搜索评分
ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))
# 从脚本中提取评价的文本。这几乎是有效的JSON格式
reviews <- scripts[ratingsScript] %>% html_text2() %>%
str_extract("\\\"reviews\\\":.+?\\}\\]\\}") %>% substring(10) %>% str_extract("\\\"reviews\\\":.+?\\}\\]\\}")
nchar(reviews) # 调试状态
# 添加一个前导 { 以使其成为有效的JSON并进行转换
answer <- jsonlite::fromJSON(paste("{", reviews))
answer[ , c(ratingRecommendToFriend, ratingCeo, ratingBusinessOutlook)]
在答案数据框中有许多有用的信息,包括工作状态、评论、评论者ID、星级评价等。
英文:
The data you are looking for is stored in a script. This answer is based on a similar question. https://stackoverflow.com/questions/72835014/web-scraping-data-that-is-not-displayed-on-a-webpage-using-rvest
It took a while searching and trial and error to get it correct. In the script there is a section that starts with "reviews": and ends with }]}. In this case it was after the second occurrence of reviews. It is a matter of extracting out this part and converting from JSON.
library(stringr)
library(xml2)
library(rvest)
library(dplyr)
url <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
page <- read_html(url)
#the ratings are stored in a data structure in a script
#find all the scripts and then search
scripts<-page %>% html_elements(xpath='//script')
#search the scripts for the ratings
ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))
#Extract text for the reviews from the script. this is the second reviews section This is almost valid JSON format
reviews <-scripts[ratingsScript] %>% html_text2() %>%
str_extract("\"reviews\":.+?\\}\\]\\}") %>% substring(10) %>% str_extract("\"reviews\":.+?\\}\\]\\}")
nchar(reviews) #debugging status
#add a leading { to make valid JSON and convert
answer <-jsonlite::fromJSON(paste("{", reviews))
answer[ , c(ratingRecommendToFriend, ratingCeo, ratingBusinessOutlook)]
There is a lot of potentially useful information in the answer data frame. Job status, comments, reviewers id, star reviews, etc.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论