2023年5月14日 03:46:37go评论92阅读模式

英文:

R Glassdoor Web Scraping

问题

I have been tasked to collect Glassdoor Reviews for different hospitals and I am having difficulties extracting the Pros, Cons, Advice to management, Recommend, CEO Approval, Business Outlook, and the small rating drop down. I have been able to extract the rest from the code below. Any help would be greatly appreciated.

英文:

library(rvest)
library(tidyverse)
library(tidyverse)
library(stringr)
   url &lt;- &quot;https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?            sort.sortType=RD&amp;sort.ascending=false&amp;filter.iso3Language=eng&quot;
   page &lt;- read_html(url)
# Extract review titles
review_titles &lt;- page %&gt;%
html_nodes(&quot;.reviewLink&quot;) %&gt;%
html_text()
# Extract review dates
review_dates &lt;- page %&gt;%
html_nodes(&quot;.middle.common__EiReviewDetailsStyle__newGrey&quot;) %&gt;%
html_text()
#Extract Pros
review_pros &lt;- page %&gt;%
html_nodes(&quot;v2__EIReviewDetailsV2__fullWidth &quot;) %&gt;%
html_text()
print(review_pros)
# Extract review ratings
review_ratings &lt;- page %&gt;%
html_nodes(&quot;.ratingNumber.mr-xsm&quot;) %&gt;%
html_text() %&gt;%
str_extract(&quot;\d+&quot;) %&gt;%
as.integer()
# Extract review recommendations
recommendations &lt;- page %&gt;%
html_nodes(&quot;html body.main.loggedIn.lang-en.en-US.gdGrid._initOk div#Container div.container-max-width.mx-auto.px-0.px-lg-lg.py-lg-xxl div.d-flex.row.css-zwxlu7.e1af7d9i0 main.col-12.mb-lg-0.mb-md.css-yaeagj.ej1dgw00 div#ReviewsRef div#ReviewsFeed ol.empReviews.emp-reviews-feed.pl-0 li#empReview_76309432.noBorder.empReview.cf.pb-0.mb-0 div.p-0.mb-0.mb-md-std.css-w5wad1.gd-ui-module.css-rntt2a.ec4dwm00 div.gdReview div.mt-xxsm div.mx-0 div.px-std div div.d-flex.my-std.reviewBodyCell.recommends.css-1y3jl3a.e1868oi10&quot;) %&gt;%
html_text()
# Convert recommendations to numeric values
recommendations_numeric &lt;- ifelse(grepl(&quot;css-hcqxoa-svg&quot;, recommendations), 1,
ifelse(grepl(&quot;css-1y3jl3a-svg&quot;, recommendations), -1, 0))
# Create data frame
reviews &lt;- data.frame(Title = review_titles, Rating = review_ratings, Date = review_dates)
# View data frame
reviews

答案1

得分: 1

我能够提取出优缺点如下：

    library(tidyverse)
    library(rvest)
    
    data <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?%20%20%20%20%20%20%20%20%20%20%20%20sort.sortType=RD&amp;sort.ascending=false&amp;filter.iso3Language=eng" %>%
      read_html() %>%
      html_elements(".empReview")
    
    tibble(
      title = data %>%
        html_element(".reviewLink") %>%
        html_text2(), 
      date = data %>%  
        html_element(".middle.common__EiReviewDetailsStyle__newGrey") %>%
        html_text2(), 
      pros = data %>%
        html_element(".v2__EIReviewDetailsV2__fullWidth:nth-child(1) span") %>%
        html_text2(), 
      cons = data %>%  
        html_element(".v2__EIReviewDetailsV2__fullWidth+ .v2__EIReviewDetailsV2__fullWidth span") %>%
        html_text2() %>%
        str_trim()
    ) %>%
      separate(col = date, into = c("date", "position"), sep = " - ")
    # A tibble: 10 × 5
       title                             date         position                         pros       cons 
       <chr>                             <chr>        <chr>                            <chr>      <chr>
     1 Great place to work               Mar 3, 2023  Manager                          Excellent… "Do …
     2 Don't bother                      May 10, 2023 Practice Manager                 Being loc… "Hor…
     3 Skeleton staffing                 Nov 16, 2022 Registered Nurse, Emergency Room Co-worker… "No …
     4 Nyack hospital                    Mar 7, 2023  Patient Care Associate (PCA)     The food … "No …
     5 Its ok                            Mar 22, 2023 Registered Nurse, BSN            one weeke… "sho…
     6 pca                               Jan 18, 2023 Patient Care Assistant (PCA)     good pay … "non…
     7 Just for starters                 Feb 3, 2023  Registered Nurse, Critical Care  Coworkers… "No …
     8 PCA                               Oct 22, 2022 Emergency Care Assistant         there sta… "the…
     9 Great way to support the Hospital Sep 16, 2022 Donor Relations Manager          Most ever… "Lon…
    10 great place to work               Sep 5, 2022  Registered Nurse                 lots of o… "lim…

英文:

I was able to pull pros and cons as such:

library(tidyverse)
library(rvest)
data &lt;- &quot;https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?%20%20%20%20%20%20%20%20%20%20%20%20sort.sortType=RD&amp;sort.ascending=false&amp;filter.iso3Language=eng&quot; %&gt;% 
  read_html() %&gt;% 
  html_elements(&quot;.empReview&quot;)
tibble(
  title = data %&gt;% 
    html_element(&quot;.reviewLink&quot;) %&gt;% 
    html_text2(), 
  date = data %&gt;%  
    html_element(&quot;.middle.common__EiReviewDetailsStyle__newGrey&quot;) %&gt;% 
    html_text2(), 
  pros = data %&gt;% 
    html_element(&quot;.v2__EIReviewDetailsV2__fullWidth:nth-child(1) span&quot;) %&gt;% 
    html_text2(), 
  cons = data %&gt;%  
    html_element(&quot;.v2__EIReviewDetailsV2__fullWidth+ .v2__EIReviewDetailsV2__fullWidth span&quot;) %&gt;% 
    html_text2() %&gt;% 
    str_trim()
) %&gt;% 
  separate(col = date, into = c(&quot;date&quot;, &quot;position&quot;), sep = &quot; - &quot;)
# A tibble: 10 &#215; 5
   title                             date         position                         pros       cons 
   &lt;chr&gt;                             &lt;chr&gt;        &lt;chr&gt;                            &lt;chr&gt;      &lt;chr&gt;
 1 Great place to work               Mar 3, 2023  Manager                          Excellent… &quot;Do …
 2 Don&#39;t bother                      May 10, 2023 Practice Manager                 Being loc… &quot;Hor…
 3 Skeleton staffing                 Nov 16, 2022 Registered Nurse, Emergency Room Co-worker… &quot;No …
 4 Nyack hospital                    Mar 7, 2023  Patient Care Associate (PCA)     The food … &quot;No …
 5 Its ok                            Mar 22, 2023 Registered Nurse, BSN            one weeke… &quot;sho…
 6 pca                               Jan 18, 2023 Patient Care Assistant (PCA)     good pay … &quot;non…
 7 Just for starters                 Feb 3, 2023  Registered Nurse, Critical Care  Coworkers… &quot;No …
 8 PCA                               Oct 22, 2022 Emergency Care Assistant         there sta… &quot;the…
 9 Great way to support the Hospital Sep 16, 2022 Donor Relations Manager          Most ever… &quot;Lon…
10 great place to work               Sep 5, 2022  Registered Nurse                 lots of o… &quot;lim…

答案2

得分: 0

抱歉，我只能提供中文翻译。以下是您提供的内容的中文翻译：

你要查找的数据存储在一个脚本中。这个答案基于一个类似的问题。链接：https://stackoverflow.com/questions/72835014/web-scraping-data-that-is-not-displayed-on-a-webpage-using-rvest

花了一段时间来搜索和试错才得到正确的结果。在脚本中，有一个以 "reviews": 开头，以 }]} 结束的部分。在这种情况下，它出现在第二次出现的“reviews”后面。只需将这部分提取出来并转换为JSON格式。

library(stringr) 
library(xml2)
library(rvest) 
library(dplyr)
url <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?sort.sortType=RD&amp;sort.ascending=false&amp;filter.iso3Language=eng"
page <- read_html(url)
# 评分存储在脚本的数据结构中
# 找到所有的脚本然后搜索
scripts <- page %>% html_elements(xpath='//script')
# 在脚本中搜索评分
ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))
# 从脚本中提取评价的文本。这几乎是有效的JSON格式
reviews <- scripts[ratingsScript] %>% html_text2() %>% 
   str_extract("\\\"reviews\\\":.+?\\}\\]\\}") %>% substring(10) %>% str_extract("\\\"reviews\\\":.+?\\}\\]\\}") 
nchar(reviews)  # 调试状态
# 添加一个前导 { 以使其成为有效的JSON并进行转换
answer <- jsonlite::fromJSON(paste("{", reviews))
answer[ , c(ratingRecommendToFriend, ratingCeo, ratingBusinessOutlook)]

在答案数据框中有许多有用的信息，包括工作状态、评论、评论者ID、星级评价等。

英文:

The data you are looking for is stored in a script. This answer is based on a similar question. https://stackoverflow.com/questions/72835014/web-scraping-data-that-is-not-displayed-on-a-webpage-using-rvest

It took a while searching and trial and error to get it correct. In the script there is a section that starts with "reviews": and ends with }]}. In this case it was after the second occurrence of reviews. It is a matter of extracting out this part and converting from JSON.

library(stringr) 
library(xml2)
library(rvest) 
library(dplyr)
url &lt;- &quot;https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?sort.sortType=RD&amp;sort.ascending=false&amp;filter.iso3Language=eng&quot;
page &lt;- read_html(url)
#the ratings are stored in a data structure in a script
#find all the scripts and then search
scripts&lt;-page %&gt;% html_elements(xpath=&#39;//script&#39;)
#search the scripts for the ratings
ratingsScript &lt;- which(grepl(&quot;ratingCareerOpportunities&quot;, scripts))
#Extract text for the reviews from the script.  this is the second reviews section This is almost valid JSON format
reviews &lt;-scripts[ratingsScript] %&gt;% html_text2() %&gt;% 
   str_extract(&quot;\&quot;reviews\&quot;:.+?\\}\\]\\}&quot;) %&gt;% substring(10) %&gt;% str_extract(&quot;\&quot;reviews\&quot;:.+?\\}\\]\\}&quot;) 
nchar(reviews)  #debugging status
#add a leading { to make valid JSON and convert
answer &lt;-jsonlite::fromJSON(paste(&quot;{&quot;, reviews))
answer[ , c(ratingRecommendToFriend, ratingCeo, ratingBusinessOutlook)]

There is a lot of potentially useful information in the answer data frame. Job status, comments, reviewers id, star reviews, etc.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

玻璃门网站抓取

问题

答案1

答案2

可以移除代码前面的所有’>’吗？

在R中如何创建一个函数，只有在达到阈值时才调用一个名称？

我怎么能根据列A的字符串值从列B中减去一个值

如何在R中删除数据框中的空白空间

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。