scraping with rvest got error no applicable method for 'xml_find_first' applied to an object of class "character"

huangapple go评论89阅读模式
英文:

scraping with rvest got error no applicable method for 'xml_find_first' applied to an object of class "character"

问题

I understand that you want me to provide the translated code parts without any additional information. Here are the translated code parts:

  1. # 必要的包
  2. library(rvest)
  3. library(dplyr)
  4. library(httr)
  5. # 搜索结果页面的基本URL
  6. base_url <- "https://www.booking.com/searchresults.it.html"
  7. # 添加到搜索的参数以获取特定结果
  8. params <- list(
  9. ss = "Firenze%2C+Toscana%2C+Italia",
  10. efdco = 1,
  11. label = "booking-name-L*Xf2U1sq4*GEkIwcLOALQS267777916051%3Apl%3Ata%3Ap1%3Ap22%2C563%2C000%3Aac%3Aap%3Aneg%3Afi%3Atikwd-65526620%3Alp9069992%3Ali%3Adec%3Adm%3Appccp",
  12. aid = 376363,
  13. lang = "it",
  14. sb = 1,
  15. src_elem = "sb",
  16. src = "index",
  17. dest_id = -117543,
  18. dest_type = "city",
  19. ac_position = 0,
  20. ac_click_type = "b",
  21. ac_langcode = "it",
  22. ac_suggestion_list_length = 5,
  23. search_selected = "true",
  24. search_pageview_id = "2e375b14ad810329",
  25. ac_meta = "GhAyZTM3NWIxNGFkODEwMzI5IAAoATICaXQ6BGZpcmVAAEoAUAA%3D",
  26. checkin = "2023-06-11",
  27. checkout = "2023-06-18",
  28. group_adults = 2,
  29. no_rooms = 1,
  30. group_children = 0,
  31. sb_travel_purpose = "leisure"
  32. )
  33. # 创建空向量以存储标题、评分、价格
  34. titles <- c()
  35. ratings <- c()
  36. prices <- c()
  37. ### 循环遍历搜索结果的每一页
  38. for (page_num in 1:35) {
  39. # 构建当前页面的URL
  40. url <- modify_url(base_url, query = c(params, page = page_num))
  41. # 读取新页面的HTML
  42. page <- read_html(url)
  43. # 从当前页面提取标题、评分、价格
  44. # 从页面的检查代码中获取元素
  45. titles_page <- page %>% html_elements("div[data-testid='title']") %>% html_text()
  46. prices_page <- titles_page %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()
  47. ratings_page <- titles_page %>% html_element("div[aria-label^='Punteggio di']") %>% html_text()
  48. # 将当前页面的标题、评分、价格追加到向量中
  49. titles <- c(titles, titles_page)
  50. prices <- c(prices, prices_page)
  51. ratings <- c(ratings, ratings_page)
  52. }
  53. hotel <- data.frame(titles, prices, ratings)
  54. print(hotel)

Please note that the code you provided is in R, and I have translated it into Chinese.

英文:

I am trying to scrape a page in booking.com with rvest and the problem is that I need the code to return NA when a hotel does not have ratings for example, so the dataframe will have exact number of rows for each parameter Im trying to scrape.

The code that I am using which functions perfectly without returing NA is this:

  1. # Necessary packages
  2. library(rvest)
  3. library(dplyr)
  4. library(httr)
  5. # Base URL of the search results page
  6. base_url &lt;- &quot;https://www.booking.com/searchresults.it.html&quot;
  7. # Parameters we add to the search get the specific results
  8. params &lt;- list(
  9. ss = &quot;Firenze%2C+Toscana%2C+Italia&quot;,
  10. efdco = 1,
  11. label = &quot;booking-name-L*Xf2U1sq4*GEkIwcLOALQS267777916051%3Apl%3Ata%3Ap1%3Ap22%2C563%2C000%3Aac%3Aap%3Aneg%3Afi%3Atikwd-65526620%3Alp9069992%3Ali%3Adec%3Adm%3Appccp&quot;,
  12. aid = 376363,
  13. lang = &quot;it&quot;,
  14. sb = 1,
  15. src_elem = &quot;sb&quot;,
  16. src = &quot;index&quot;,
  17. dest_id = -117543,
  18. dest_type = &quot;city&quot;,
  19. ac_position = 0,
  20. ac_click_type = &quot;b&quot;,
  21. ac_langcode = &quot;it&quot;,
  22. ac_suggestion_list_length = 5,
  23. search_selected = &quot;true&quot;,
  24. search_pageview_id = &quot;2e375b14ad810329&quot;,
  25. ac_meta = &quot;GhAyZTM3NWIxNGFkODEwMzI5IAAoATICaXQ6BGZpcmVAAEoAUAA%3D&quot;,
  26. checkin = &quot;2023-06-11&quot;,
  27. checkout = &quot;2023-06-18&quot;,
  28. group_adults = 2,
  29. no_rooms = 1,
  30. group_children = 0,
  31. sb_travel_purpose = &quot;leisure&quot;
  32. )
  33. # Create empty vectors to store the titles, rating, price
  34. titles &lt;- c()
  35. ratings &lt;- c()
  36. prices &lt;- c()
  37. ### Loop through each page of the search results
  38. for (page_num in 1:35) {
  39. # Build the URL for the current page
  40. url &lt;- modify_url(base_url, query = c(params, page = page_num))
  41. # Read the HTML of the new page specificated
  42. page &lt;- read_html(url)
  43. # Extract the titles, rating, price from the current page
  44. # Got the elements from Inspect code of the page
  45. titles_page &lt;- page %&gt;% html_elements(&quot;div[data-testid=&#39;title&#39;]&quot;) %&gt;% html_text()
  46. prices_page &lt;- titles_page %&gt;% html_element(&quot;span[data-testid=&#39;price-and-discounted-price&#39;]&quot;) %&gt;% html_text()
  47. ratings_page &lt;- titles_page %&gt;% html_element(&quot;div[aria-label^=&#39;Punteggio di&#39;]&quot;) %&gt;% html_text()
  48. # Append the titles, ratings, prices from the current page to the vector
  49. titles &lt;- c(titles, titles_page)
  50. prices &lt;- c(prices, prices_page)
  51. ratings &lt;- c(ratings, ratings_page)
  52. }
  53. hotel = data.frame(titles, prices, ratings)
  54. print(hotel)```
  55. I have seen being suggested to add a paretn and children node and I have tried this but it does not function:
  56. ```titles_page &lt;- page %&gt;% html_elements(&quot;div[data-testid=&#39;title&#39;]&quot;) %&gt;% html_text()
  57. prices_page &lt;- titles_page %&gt;% html_element(&quot;span[data-testid=&#39;price-and-discounted-price&#39;]&quot;) %&gt;% html_text()
  58. ratings_page &lt;- titles_page %&gt;% html_element(&quot;div[aria-label^=&#39;Punteggio di&#39;]&quot;) %&gt;% html_text()```
  59. </details>
  60. # 答案1
  61. **得分**: 2
  62. `titles_page &lt;- page %&gt;% html_elements(&quot;div[data-testid=&#39;title&#39;]&quot;) %&gt;% html_text()` 正在创建一个字符字符串向量。
  63. 你不能在下一行的代码中解析 "titles_page"。
  64. 你跳过了创建父节点向量的步骤。请查看你之前的问题/回答 https://stackoverflow.com/questions/76029792/how-to-report-na-when-scraping-a-web-with-r-and-it-does-not-have-value 并查看答案中的 `properties &lt;- html_elements(page, xpath=&quot;.//div[@data-testid=&#39;property-card&#39;]&quot;)` 这一行。这将返回一个 XML 节点的向量。现在解析这个节点向量以获取所需的信息。
  65. 错误是没有正确处理以下这些行:
  66. ```R
  67. #查找父节点
  68. properties &lt;- html_elements(page, xpath=&quot;.//div[@data-testid=&#39;property-card&#39;]&quot;)
  69. #从每个父节点中获取信息
  70. titles_page &lt;- properties %&gt;% html_element(&quot;div[data-testid=&#39;title&#39;]&quot;) %&gt;% html_text()
  71. prices_page &lt;- properties %&gt;% html_element(&quot;span[data-testid=&#39;price-and-discounted-price&#39;]&quot;) %&gt;% html_text()
  72. ratings_page &lt;- properties %&gt;% html_element(&quot;div[aria-label^=&#39;Punteggio di&#39;]&quot;) %&gt;% html_text()

完整的已纠正循环现在如下:

  1. for (page_num in 1:35) {
  2. # 构建当前页面的 URL
  3. url &lt;- modify_url(base_url, query = c(params, page = page_num))
  4. # 读取特定页面的 HTML
  5. page &lt;- read_html(url)
  6. # 解析出每个父节点
  7. properties &lt;- html_elements(page, xpath=&quot;.//div[@data-testid=&#39;property-card&#39;]&quot;)
  8. # 现在从每个父节点中查找信息
  9. titles_page &lt;- properties %&gt;% html_element(&quot;div[data-testid=&#39;title&#39;]&quot;) %&gt;% html_text()
  10. prices_page &lt;- properties %&gt;% html_element(&quot;span[data-testid=&#39;price-and-discounted-price&#39;]&quot;) %&gt;% html_text()
  11. ratings_page &lt;- properties %&gt;% html_element(&quot;div[aria-label^=&#39;Punteggio di&#39;]&quot;) %&gt;% html_text()
  12. # 将当前页面的标题、评分和价格附加到向量中
  13. titles &lt;- c(titles, titles_page)
  14. prices &lt;- c(prices, prices_page)
  15. ratings &lt;- c(ratings, ratings_page)
  16. }
英文:

titles_page &lt;- page %&gt;% html_elements(&quot;div[data-testid=&#39;title&#39;]&quot;) %&gt;% html_text() is creating a vector of character strings.
You cannot parse "titles_page" in the next line of code.
You are skipping the step of creating a vector of parent nodes. Review your previous question/answer https://stackoverflow.com/questions/76029792/how-to-report-na-when-scraping-a-web-with-r-and-it-does-not-have-value and look at the line properties &lt;- html_elements(page, xpath=&quot;.//div[@data-testid=&#39;property-card&#39;]&quot;) in the answer. This is returning a vector of xml nodes. Now parse this vector of nodes to obtain the desired information.

Error was not having these lines correct:

  1. #find the parents
  2. properties &lt;- html_elements(page, xpath=&quot;.//div[@data-testid=&#39;property-card&#39;]&quot;)
  3. #getting the information from each parent
  4. titles_page &lt;- properties %&gt;% html_element(&quot;div[data-testid=&#39;title&#39;]&quot;) %&gt;% html_text()
  5. prices_page &lt;- properties %&gt;% html_element(&quot;span[data-testid=&#39;price-and-discounted-price&#39;]&quot;) %&gt;% html_text()
  6. ratings_page &lt;- properties %&gt;% html_element(&quot;div[aria-label^=&#39;Punteggio di&#39;]&quot;) %&gt;% html_text()

The full corrected loops is now:

  1. for (page_num in 1:35) {
  2. # Build the URL for the current page
  3. url &lt;- modify_url(base_url, query = c(params, page = page_num))
  4. # Read the HTML of the new page specificated
  5. page &lt;- read_html(url)
  6. #parse out the parent node for each parent
  7. properties &lt;- html_elements(page, xpath=&quot;.//div[@data-testid=&#39;property-card&#39;]&quot;)
  8. #now find the information from each parent
  9. titles_page &lt;- properties %&gt;% html_element(&quot;div[data-testid=&#39;title&#39;]&quot;) %&gt;% html_text()
  10. prices_page &lt;- properties %&gt;% html_element(&quot;span[data-testid=&#39;price-and-discounted-price&#39;]&quot;) %&gt;% html_text()
  11. ratings_page &lt;- properties %&gt;% html_element(&quot;div[aria-label^=&#39;Punteggio di&#39;]&quot;) %&gt;% html_text()
  12. # Append the titles, ratings, prices from the current page to the vector
  13. titles &lt;- c(titles, titles_page)
  14. prices &lt;- c(prices, prices_page)
  15. ratings &lt;- c(ratings, ratings_page)
  16. }

huangapple
  • 本文由 发表于 2023年4月20日 05:27:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76058936.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定