玻璃门网站抓取

huangapple go评论67阅读模式
英文:

R Glassdoor Web Scraping

问题

I have been tasked to collect Glassdoor Reviews for different hospitals and I am having difficulties extracting the Pros, Cons, Advice to management, Recommend, CEO Approval, Business Outlook, and the small rating drop down. I have been able to extract the rest from the code below. Any help would be greatly appreciated.

英文:

I have been tasked to collect Glassdoor Reviews for different hospitals and I am having difficulties extracting the Pros, Cons, Advice to management, Recommend, CEO Approval, Business Outlook, and the small rating drop down. I have been able to extract the rest from the code below. Any help would be greatly appreciated.

library(rvest)
library(tidyverse)
library(tidyverse)
library(stringr)

   url <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?            sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
   page <- read_html(url)

# Extract review titles

review_titles <- page %>%
html_nodes(".reviewLink") %>%
html_text()

# Extract review dates

review_dates <- page %>%
html_nodes(".middle.common__EiReviewDetailsStyle__newGrey") %>%
html_text()

#Extract Pros
review_pros <- page %>%
html_nodes("v2__EIReviewDetailsV2__fullWidth ") %>%
html_text()
print(review_pros)

# Extract review ratings

review_ratings <- page %>%
html_nodes(".ratingNumber.mr-xsm") %>%
html_text() %>%
str_extract("\d+") %>%
as.integer()

# Extract review recommendations

recommendations <- page %>%
html_nodes("html body.main.loggedIn.lang-en.en-US.gdGrid._initOk div#Container div.container-max-width.mx-auto.px-0.px-lg-lg.py-lg-xxl div.d-flex.row.css-zwxlu7.e1af7d9i0 main.col-12.mb-lg-0.mb-md.css-yaeagj.ej1dgw00 div#ReviewsRef div#ReviewsFeed ol.empReviews.emp-reviews-feed.pl-0 li#empReview_76309432.noBorder.empReview.cf.pb-0.mb-0 div.p-0.mb-0.mb-md-std.css-w5wad1.gd-ui-module.css-rntt2a.ec4dwm00 div.gdReview div.mt-xxsm div.mx-0 div.px-std div div.d-flex.my-std.reviewBodyCell.recommends.css-1y3jl3a.e1868oi10") %>%
html_text()

# Convert recommendations to numeric values

recommendations_numeric <- ifelse(grepl("css-hcqxoa-svg", recommendations), 1,
ifelse(grepl("css-1y3jl3a-svg", recommendations), -1, 0))

# Create data frame

reviews <- data.frame(Title = review_titles, Rating = review_ratings, Date = review_dates)

# View data frame

reviews

答案1

得分: 1

我能够提取出优缺点如下:

    library(tidyverse)
    library(rvest)
    
    data <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?%20%20%20%20%20%20%20%20%20%20%20%20sort.sortType=RD&amp;sort.ascending=false&amp;filter.iso3Language=eng" %>%
      read_html() %>%
      html_elements(".empReview")
    
    tibble(
      title = data %>%
        html_element(".reviewLink") %>%
        html_text2(), 
      date = data %>%  
        html_element(".middle.common__EiReviewDetailsStyle__newGrey") %>%
        html_text2(), 
      pros = data %>%
        html_element(".v2__EIReviewDetailsV2__fullWidth:nth-child(1) span") %>%
        html_text2(), 
      cons = data %>%  
        html_element(".v2__EIReviewDetailsV2__fullWidth+ .v2__EIReviewDetailsV2__fullWidth span") %>%
        html_text2() %>%
        str_trim()
    ) %>%
      separate(col = date, into = c("date", "position"), sep = " - ")

    # A tibble: 10 × 5
       title                             date         position                         pros       cons 
       <chr>                             <chr>        <chr>                            <chr>      <chr>
     1 Great place to work               Mar 3, 2023  Manager                          Excellent… "Do …
     2 Don't bother                      May 10, 2023 Practice Manager                 Being loc… "Hor…
     3 Skeleton staffing                 Nov 16, 2022 Registered Nurse, Emergency Room Co-worker… "No …
     4 Nyack hospital                    Mar 7, 2023  Patient Care Associate (PCA)     The food … "No …
     5 Its ok                            Mar 22, 2023 Registered Nurse, BSN            one weeke… "sho…
     6 pca                               Jan 18, 2023 Patient Care Assistant (PCA)     good pay … "non…
     7 Just for starters                 Feb 3, 2023  Registered Nurse, Critical Care  Coworkers… "No …
     8 PCA                               Oct 22, 2022 Emergency Care Assistant         there sta… "the…
     9 Great way to support the Hospital Sep 16, 2022 Donor Relations Manager          Most ever… "Lon…
    10 great place to work               Sep 5, 2022  Registered Nurse                 lots of o… "lim…
英文:

I was able to pull pros and cons as such:

library(tidyverse)
library(rvest)

data &lt;- &quot;https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?%20%20%20%20%20%20%20%20%20%20%20%20sort.sortType=RD&amp;sort.ascending=false&amp;filter.iso3Language=eng&quot; %&gt;% 
  read_html() %&gt;% 
  html_elements(&quot;.empReview&quot;)

tibble(
  title = data %&gt;% 
    html_element(&quot;.reviewLink&quot;) %&gt;% 
    html_text2(), 
  date = data %&gt;%  
    html_element(&quot;.middle.common__EiReviewDetailsStyle__newGrey&quot;) %&gt;% 
    html_text2(), 
  pros = data %&gt;% 
    html_element(&quot;.v2__EIReviewDetailsV2__fullWidth:nth-child(1) span&quot;) %&gt;% 
    html_text2(), 
  cons = data %&gt;%  
    html_element(&quot;.v2__EIReviewDetailsV2__fullWidth+ .v2__EIReviewDetailsV2__fullWidth span&quot;) %&gt;% 
    html_text2() %&gt;% 
    str_trim()
) %&gt;% 
  separate(col = date, into = c(&quot;date&quot;, &quot;position&quot;), sep = &quot; - &quot;)

# A tibble: 10 &#215; 5
   title                             date         position                         pros       cons 
   &lt;chr&gt;                             &lt;chr&gt;        &lt;chr&gt;                            &lt;chr&gt;      &lt;chr&gt;
 1 Great place to work               Mar 3, 2023  Manager                          Excellent… &quot;Do …
 2 Don&#39;t bother                      May 10, 2023 Practice Manager                 Being loc… &quot;Hor…
 3 Skeleton staffing                 Nov 16, 2022 Registered Nurse, Emergency Room Co-worker… &quot;No …
 4 Nyack hospital                    Mar 7, 2023  Patient Care Associate (PCA)     The food … &quot;No …
 5 Its ok                            Mar 22, 2023 Registered Nurse, BSN            one weeke… &quot;sho…
 6 pca                               Jan 18, 2023 Patient Care Assistant (PCA)     good pay … &quot;non…
 7 Just for starters                 Feb 3, 2023  Registered Nurse, Critical Care  Coworkers… &quot;No …
 8 PCA                               Oct 22, 2022 Emergency Care Assistant         there sta… &quot;the…
 9 Great way to support the Hospital Sep 16, 2022 Donor Relations Manager          Most ever… &quot;Lon…
10 great place to work               Sep 5, 2022  Registered Nurse                 lots of o… &quot;lim…

答案2

得分: 0

抱歉,我只能提供中文翻译。以下是您提供的内容的中文翻译:

你要查找的数据存储在一个脚本中。这个答案基于一个类似的问题。链接:https://stackoverflow.com/questions/72835014/web-scraping-data-that-is-not-displayed-on-a-webpage-using-rvest

花了一段时间来搜索和试错才得到正确的结果。在脚本中,有一个以 "reviews": 开头,以 }]} 结束的部分。在这种情况下,它出现在第二次出现的“reviews”后面。只需将这部分提取出来并转换为JSON格式。

library(stringr) 
library(xml2)
library(rvest) 
library(dplyr)

url <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?sort.sortType=RD&amp;sort.ascending=false&amp;filter.iso3Language=eng"
page <- read_html(url)

# 评分存储在脚本的数据结构中
# 找到所有的脚本然后搜索
scripts <- page %>% html_elements(xpath='//script')

# 在脚本中搜索评分
ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))

# 从脚本中提取评价的文本。这几乎是有效的JSON格式
reviews <- scripts[ratingsScript] %>% html_text2() %>% 
   str_extract("\\\"reviews\\\":.+?\\}\\]\\}") %>% substring(10) %>% str_extract("\\\"reviews\\\":.+?\\}\\]\\}") 
nchar(reviews)  # 调试状态

# 添加一个前导 { 以使其成为有效的JSON并进行转换
answer <- jsonlite::fromJSON(paste("{", reviews))
answer[ , c(ratingRecommendToFriend, ratingCeo, ratingBusinessOutlook)]

在答案数据框中有许多有用的信息,包括工作状态、评论、评论者ID、星级评价等。

英文:

The data you are looking for is stored in a script. This answer is based on a similar question. https://stackoverflow.com/questions/72835014/web-scraping-data-that-is-not-displayed-on-a-webpage-using-rvest

It took a while searching and trial and error to get it correct. In the script there is a section that starts with "reviews": and ends with }]}. In this case it was after the second occurrence of reviews. It is a matter of extracting out this part and converting from JSON.

library(stringr) 
library(xml2)
library(rvest) 
library(dplyr)

url &lt;- &quot;https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?sort.sortType=RD&amp;sort.ascending=false&amp;filter.iso3Language=eng&quot;
page &lt;- read_html(url)


#the ratings are stored in a data structure in a script
#find all the scripts and then search
scripts&lt;-page %&gt;% html_elements(xpath=&#39;//script&#39;)

#search the scripts for the ratings
ratingsScript &lt;- which(grepl(&quot;ratingCareerOpportunities&quot;, scripts))

#Extract text for the reviews from the script.  this is the second reviews section This is almost valid JSON format
reviews &lt;-scripts[ratingsScript] %&gt;% html_text2() %&gt;% 
   str_extract(&quot;\&quot;reviews\&quot;:.+?\\}\\]\\}&quot;) %&gt;% substring(10) %&gt;% str_extract(&quot;\&quot;reviews\&quot;:.+?\\}\\]\\}&quot;) 
nchar(reviews)  #debugging status

#add a leading { to make valid JSON and convert
answer &lt;-jsonlite::fromJSON(paste(&quot;{&quot;, reviews))
answer[ , c(ratingRecommendToFriend, ratingCeo, ratingBusinessOutlook)]

There is a lot of potentially useful information in the answer data frame. Job status, comments, reviewers id, star reviews, etc.

huangapple
  • 本文由 发表于 2023年5月14日 03:46:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76244599.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定