玻璃门网站抓取

huangapple go评论92阅读模式
英文:

R Glassdoor Web Scraping

问题

I have been tasked to collect Glassdoor Reviews for different hospitals and I am having difficulties extracting the Pros, Cons, Advice to management, Recommend, CEO Approval, Business Outlook, and the small rating drop down. I have been able to extract the rest from the code below. Any help would be greatly appreciated.

英文:

I have been tasked to collect Glassdoor Reviews for different hospitals and I am having difficulties extracting the Pros, Cons, Advice to management, Recommend, CEO Approval, Business Outlook, and the small rating drop down. I have been able to extract the rest from the code below. Any help would be greatly appreciated.

  1. library(rvest)
  2. library(tidyverse)
  3. library(tidyverse)
  4. library(stringr)
  5. url <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm? sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
  6. page <- read_html(url)
  7. # Extract review titles
  8. review_titles <- page %>%
  9. html_nodes(".reviewLink") %>%
  10. html_text()
  11. # Extract review dates
  12. review_dates <- page %>%
  13. html_nodes(".middle.common__EiReviewDetailsStyle__newGrey") %>%
  14. html_text()
  15. #Extract Pros
  16. review_pros <- page %>%
  17. html_nodes("v2__EIReviewDetailsV2__fullWidth ") %>%
  18. html_text()
  19. print(review_pros)
  20. # Extract review ratings
  21. review_ratings <- page %>%
  22. html_nodes(".ratingNumber.mr-xsm") %>%
  23. html_text() %>%
  24. str_extract("\d+") %>%
  25. as.integer()
  26. # Extract review recommendations
  27. recommendations <- page %>%
  28. html_nodes("html body.main.loggedIn.lang-en.en-US.gdGrid._initOk div#Container div.container-max-width.mx-auto.px-0.px-lg-lg.py-lg-xxl div.d-flex.row.css-zwxlu7.e1af7d9i0 main.col-12.mb-lg-0.mb-md.css-yaeagj.ej1dgw00 div#ReviewsRef div#ReviewsFeed ol.empReviews.emp-reviews-feed.pl-0 li#empReview_76309432.noBorder.empReview.cf.pb-0.mb-0 div.p-0.mb-0.mb-md-std.css-w5wad1.gd-ui-module.css-rntt2a.ec4dwm00 div.gdReview div.mt-xxsm div.mx-0 div.px-std div div.d-flex.my-std.reviewBodyCell.recommends.css-1y3jl3a.e1868oi10") %>%
  29. html_text()
  30. # Convert recommendations to numeric values
  31. recommendations_numeric <- ifelse(grepl("css-hcqxoa-svg", recommendations), 1,
  32. ifelse(grepl("css-1y3jl3a-svg", recommendations), -1, 0))
  33. # Create data frame
  34. reviews <- data.frame(Title = review_titles, Rating = review_ratings, Date = review_dates)
  35. # View data frame
  36. reviews

答案1

得分: 1

我能够提取出优缺点如下:

  1. library(tidyverse)
  2. library(rvest)
  3. data <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?%20%20%20%20%20%20%20%20%20%20%20%20sort.sortType=RD&amp;sort.ascending=false&amp;filter.iso3Language=eng" %>%
  4. read_html() %>%
  5. html_elements(".empReview")
  6. tibble(
  7. title = data %>%
  8. html_element(".reviewLink") %>%
  9. html_text2(),
  10. date = data %>%
  11. html_element(".middle.common__EiReviewDetailsStyle__newGrey") %>%
  12. html_text2(),
  13. pros = data %>%
  14. html_element(".v2__EIReviewDetailsV2__fullWidth:nth-child(1) span") %>%
  15. html_text2(),
  16. cons = data %>%
  17. html_element(".v2__EIReviewDetailsV2__fullWidth+ .v2__EIReviewDetailsV2__fullWidth span") %>%
  18. html_text2() %>%
  19. str_trim()
  20. ) %>%
  21. separate(col = date, into = c("date", "position"), sep = " - ")
  22. # A tibble: 10 × 5
  23. title date position pros cons
  24. <chr> <chr> <chr> <chr> <chr>
  25. 1 Great place to work Mar 3, 2023 Manager Excellent "Do …
  26. 2 Don't bother May 10, 2023 Practice Manager Being loc… "Hor
  27. 3 Skeleton staffing Nov 16, 2022 Registered Nurse, Emergency Room Co-worker "No …
  28. 4 Nyack hospital Mar 7, 2023 Patient Care Associate (PCA) The food … "No
  29. 5 Its ok Mar 22, 2023 Registered Nurse, BSN one weeke "sho…
  30. 6 pca Jan 18, 2023 Patient Care Assistant (PCA) good pay … "non
  31. 7 Just for starters Feb 3, 2023 Registered Nurse, Critical Care Coworkers "No …
  32. 8 PCA Oct 22, 2022 Emergency Care Assistant there sta… "the
  33. 9 Great way to support the Hospital Sep 16, 2022 Donor Relations Manager Most ever "Lon…
  34. 10 great place to work Sep 5, 2022 Registered Nurse lots of o… "lim
英文:

I was able to pull pros and cons as such:

  1. library(tidyverse)
  2. library(rvest)
  3. data &lt;- &quot;https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?%20%20%20%20%20%20%20%20%20%20%20%20sort.sortType=RD&amp;sort.ascending=false&amp;filter.iso3Language=eng&quot; %&gt;%
  4. read_html() %&gt;%
  5. html_elements(&quot;.empReview&quot;)
  6. tibble(
  7. title = data %&gt;%
  8. html_element(&quot;.reviewLink&quot;) %&gt;%
  9. html_text2(),
  10. date = data %&gt;%
  11. html_element(&quot;.middle.common__EiReviewDetailsStyle__newGrey&quot;) %&gt;%
  12. html_text2(),
  13. pros = data %&gt;%
  14. html_element(&quot;.v2__EIReviewDetailsV2__fullWidth:nth-child(1) span&quot;) %&gt;%
  15. html_text2(),
  16. cons = data %&gt;%
  17. html_element(&quot;.v2__EIReviewDetailsV2__fullWidth+ .v2__EIReviewDetailsV2__fullWidth span&quot;) %&gt;%
  18. html_text2() %&gt;%
  19. str_trim()
  20. ) %&gt;%
  21. separate(col = date, into = c(&quot;date&quot;, &quot;position&quot;), sep = &quot; - &quot;)
  22. # A tibble: 10 &#215; 5
  23. title date position pros cons
  24. &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  25. 1 Great place to work Mar 3, 2023 Manager Excellent &quot;Do
  26. 2 Don&#39;t bother May 10, 2023 Practice Manager Being loc &quot;Hor
  27. 3 Skeleton staffing Nov 16, 2022 Registered Nurse, Emergency Room Co-worker &quot;No
  28. 4 Nyack hospital Mar 7, 2023 Patient Care Associate (PCA) The food &quot;No
  29. 5 Its ok Mar 22, 2023 Registered Nurse, BSN one weeke &quot;sho
  30. 6 pca Jan 18, 2023 Patient Care Assistant (PCA) good pay &quot;non
  31. 7 Just for starters Feb 3, 2023 Registered Nurse, Critical Care Coworkers &quot;No
  32. 8 PCA Oct 22, 2022 Emergency Care Assistant there sta &quot;the
  33. 9 Great way to support the Hospital Sep 16, 2022 Donor Relations Manager Most ever &quot;Lon
  34. 10 great place to work Sep 5, 2022 Registered Nurse lots of o &quot;lim

答案2

得分: 0

抱歉,我只能提供中文翻译。以下是您提供的内容的中文翻译:

你要查找的数据存储在一个脚本中。这个答案基于一个类似的问题。链接:https://stackoverflow.com/questions/72835014/web-scraping-data-that-is-not-displayed-on-a-webpage-using-rvest

花了一段时间来搜索和试错才得到正确的结果。在脚本中,有一个以 "reviews": 开头,以 }]} 结束的部分。在这种情况下,它出现在第二次出现的“reviews”后面。只需将这部分提取出来并转换为JSON格式。

  1. library(stringr)
  2. library(xml2)
  3. library(rvest)
  4. library(dplyr)
  5. url <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?sort.sortType=RD&amp;sort.ascending=false&amp;filter.iso3Language=eng"
  6. page <- read_html(url)
  7. # 评分存储在脚本的数据结构中
  8. # 找到所有的脚本然后搜索
  9. scripts <- page %>% html_elements(xpath='//script')
  10. # 在脚本中搜索评分
  11. ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))
  12. # 从脚本中提取评价的文本。这几乎是有效的JSON格式
  13. reviews <- scripts[ratingsScript] %>% html_text2() %>%
  14. str_extract("\\\"reviews\\\":.+?\\}\\]\\}") %>% substring(10) %>% str_extract("\\\"reviews\\\":.+?\\}\\]\\}")
  15. nchar(reviews) # 调试状态
  16. # 添加一个前导 { 以使其成为有效的JSON并进行转换
  17. answer <- jsonlite::fromJSON(paste("{", reviews))
  18. answer[ , c(ratingRecommendToFriend, ratingCeo, ratingBusinessOutlook)]

在答案数据框中有许多有用的信息,包括工作状态、评论、评论者ID、星级评价等。

英文:

The data you are looking for is stored in a script. This answer is based on a similar question. https://stackoverflow.com/questions/72835014/web-scraping-data-that-is-not-displayed-on-a-webpage-using-rvest

It took a while searching and trial and error to get it correct. In the script there is a section that starts with "reviews": and ends with }]}. In this case it was after the second occurrence of reviews. It is a matter of extracting out this part and converting from JSON.

  1. library(stringr)
  2. library(xml2)
  3. library(rvest)
  4. library(dplyr)
  5. url &lt;- &quot;https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?sort.sortType=RD&amp;sort.ascending=false&amp;filter.iso3Language=eng&quot;
  6. page &lt;- read_html(url)
  7. #the ratings are stored in a data structure in a script
  8. #find all the scripts and then search
  9. scripts&lt;-page %&gt;% html_elements(xpath=&#39;//script&#39;)
  10. #search the scripts for the ratings
  11. ratingsScript &lt;- which(grepl(&quot;ratingCareerOpportunities&quot;, scripts))
  12. #Extract text for the reviews from the script. this is the second reviews section This is almost valid JSON format
  13. reviews &lt;-scripts[ratingsScript] %&gt;% html_text2() %&gt;%
  14. str_extract(&quot;\&quot;reviews\&quot;:.+?\\}\\]\\}&quot;) %&gt;% substring(10) %&gt;% str_extract(&quot;\&quot;reviews\&quot;:.+?\\}\\]\\}&quot;)
  15. nchar(reviews) #debugging status
  16. #add a leading { to make valid JSON and convert
  17. answer &lt;-jsonlite::fromJSON(paste(&quot;{&quot;, reviews))
  18. answer[ , c(ratingRecommendToFriend, ratingCeo, ratingBusinessOutlook)]

There is a lot of potentially useful information in the answer data frame. Job status, comments, reviewers id, star reviews, etc.

huangapple
  • 本文由 发表于 2023年5月14日 03:46:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76244599.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定