英文:
scraping with rvest got error no applicable method for 'xml_find_first' applied to an object of class "character"
问题
I understand that you want me to provide the translated code parts without any additional information. Here are the translated code parts:
# 必要的包
library(rvest)
library(dplyr)
library(httr)
# 搜索结果页面的基本URL
base_url <- "https://www.booking.com/searchresults.it.html"
# 添加到搜索的参数以获取特定结果
params <- list(
ss = "Firenze%2C+Toscana%2C+Italia",
efdco = 1,
label = "booking-name-L*Xf2U1sq4*GEkIwcLOALQS267777916051%3Apl%3Ata%3Ap1%3Ap22%2C563%2C000%3Aac%3Aap%3Aneg%3Afi%3Atikwd-65526620%3Alp9069992%3Ali%3Adec%3Adm%3Appccp",
aid = 376363,
lang = "it",
sb = 1,
src_elem = "sb",
src = "index",
dest_id = -117543,
dest_type = "city",
ac_position = 0,
ac_click_type = "b",
ac_langcode = "it",
ac_suggestion_list_length = 5,
search_selected = "true",
search_pageview_id = "2e375b14ad810329",
ac_meta = "GhAyZTM3NWIxNGFkODEwMzI5IAAoATICaXQ6BGZpcmVAAEoAUAA%3D",
checkin = "2023-06-11",
checkout = "2023-06-18",
group_adults = 2,
no_rooms = 1,
group_children = 0,
sb_travel_purpose = "leisure"
)
# 创建空向量以存储标题、评分、价格
titles <- c()
ratings <- c()
prices <- c()
### 循环遍历搜索结果的每一页
for (page_num in 1:35) {
# 构建当前页面的URL
url <- modify_url(base_url, query = c(params, page = page_num))
# 读取新页面的HTML
page <- read_html(url)
# 从当前页面提取标题、评分、价格
# 从页面的检查代码中获取元素
titles_page <- page %>% html_elements("div[data-testid='title']") %>% html_text()
prices_page <- titles_page %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()
ratings_page <- titles_page %>% html_element("div[aria-label^='Punteggio di']") %>% html_text()
# 将当前页面的标题、评分、价格追加到向量中
titles <- c(titles, titles_page)
prices <- c(prices, prices_page)
ratings <- c(ratings, ratings_page)
}
hotel <- data.frame(titles, prices, ratings)
print(hotel)
Please note that the code you provided is in R, and I have translated it into Chinese.
英文:
I am trying to scrape a page in booking.com with rvest and the problem is that I need the code to return NA when a hotel does not have ratings for example, so the dataframe will have exact number of rows for each parameter Im trying to scrape.
The code that I am using which functions perfectly without returing NA is this:
# Necessary packages
library(rvest)
library(dplyr)
library(httr)
# Base URL of the search results page
base_url <- "https://www.booking.com/searchresults.it.html"
# Parameters we add to the search get the specific results
params <- list(
ss = "Firenze%2C+Toscana%2C+Italia",
efdco = 1,
label = "booking-name-L*Xf2U1sq4*GEkIwcLOALQS267777916051%3Apl%3Ata%3Ap1%3Ap22%2C563%2C000%3Aac%3Aap%3Aneg%3Afi%3Atikwd-65526620%3Alp9069992%3Ali%3Adec%3Adm%3Appccp",
aid = 376363,
lang = "it",
sb = 1,
src_elem = "sb",
src = "index",
dest_id = -117543,
dest_type = "city",
ac_position = 0,
ac_click_type = "b",
ac_langcode = "it",
ac_suggestion_list_length = 5,
search_selected = "true",
search_pageview_id = "2e375b14ad810329",
ac_meta = "GhAyZTM3NWIxNGFkODEwMzI5IAAoATICaXQ6BGZpcmVAAEoAUAA%3D",
checkin = "2023-06-11",
checkout = "2023-06-18",
group_adults = 2,
no_rooms = 1,
group_children = 0,
sb_travel_purpose = "leisure"
)
# Create empty vectors to store the titles, rating, price
titles <- c()
ratings <- c()
prices <- c()
### Loop through each page of the search results
for (page_num in 1:35) {
# Build the URL for the current page
url <- modify_url(base_url, query = c(params, page = page_num))
# Read the HTML of the new page specificated
page <- read_html(url)
# Extract the titles, rating, price from the current page
# Got the elements from Inspect code of the page
titles_page <- page %>% html_elements("div[data-testid='title']") %>% html_text()
prices_page <- titles_page %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()
ratings_page <- titles_page %>% html_element("div[aria-label^='Punteggio di']") %>% html_text()
# Append the titles, ratings, prices from the current page to the vector
titles <- c(titles, titles_page)
prices <- c(prices, prices_page)
ratings <- c(ratings, ratings_page)
}
hotel = data.frame(titles, prices, ratings)
print(hotel)```
I have seen being suggested to add a paretn and children node and I have tried this but it does not function:
```titles_page <- page %>% html_elements("div[data-testid='title']") %>% html_text()
prices_page <- titles_page %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()
ratings_page <- titles_page %>% html_element("div[aria-label^='Punteggio di']") %>% html_text()```
</details>
# 答案1
**得分**: 2
`titles_page <- page %>% html_elements("div[data-testid='title']") %>% html_text()` 正在创建一个字符字符串向量。
你不能在下一行的代码中解析 "titles_page"。
你跳过了创建父节点向量的步骤。请查看你之前的问题/回答 https://stackoverflow.com/questions/76029792/how-to-report-na-when-scraping-a-web-with-r-and-it-does-not-have-value 并查看答案中的 `properties <- html_elements(page, xpath=".//div[@data-testid='property-card']")` 这一行。这将返回一个 XML 节点的向量。现在解析这个节点向量以获取所需的信息。
错误是没有正确处理以下这些行:
```R
#查找父节点
properties <- html_elements(page, xpath=".//div[@data-testid='property-card']")
#从每个父节点中获取信息
titles_page <- properties %>% html_element("div[data-testid='title']") %>% html_text()
prices_page <- properties %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()
ratings_page <- properties %>% html_element("div[aria-label^='Punteggio di']") %>% html_text()
完整的已纠正循环现在如下:
for (page_num in 1:35) {
# 构建当前页面的 URL
url <- modify_url(base_url, query = c(params, page = page_num))
# 读取特定页面的 HTML
page <- read_html(url)
# 解析出每个父节点
properties <- html_elements(page, xpath=".//div[@data-testid='property-card']")
# 现在从每个父节点中查找信息
titles_page <- properties %>% html_element("div[data-testid='title']") %>% html_text()
prices_page <- properties %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()
ratings_page <- properties %>% html_element("div[aria-label^='Punteggio di']") %>% html_text()
# 将当前页面的标题、评分和价格附加到向量中
titles <- c(titles, titles_page)
prices <- c(prices, prices_page)
ratings <- c(ratings, ratings_page)
}
英文:
titles_page <- page %>% html_elements("div[data-testid='title']") %>% html_text()
is creating a vector of character strings.
You cannot parse "titles_page" in the next line of code.
You are skipping the step of creating a vector of parent nodes. Review your previous question/answer https://stackoverflow.com/questions/76029792/how-to-report-na-when-scraping-a-web-with-r-and-it-does-not-have-value and look at the line properties <- html_elements(page, xpath=".//div[@data-testid='property-card']")
in the answer. This is returning a vector of xml nodes. Now parse this vector of nodes to obtain the desired information.
Error was not having these lines correct:
#find the parents
properties <- html_elements(page, xpath=".//div[@data-testid='property-card']")
#getting the information from each parent
titles_page <- properties %>% html_element("div[data-testid='title']") %>% html_text()
prices_page <- properties %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()
ratings_page <- properties %>% html_element("div[aria-label^='Punteggio di']") %>% html_text()
The full corrected loops is now:
for (page_num in 1:35) {
# Build the URL for the current page
url <- modify_url(base_url, query = c(params, page = page_num))
# Read the HTML of the new page specificated
page <- read_html(url)
#parse out the parent node for each parent
properties <- html_elements(page, xpath=".//div[@data-testid='property-card']")
#now find the information from each parent
titles_page <- properties %>% html_element("div[data-testid='title']") %>% html_text()
prices_page <- properties %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()
ratings_page <- properties %>% html_element("div[aria-label^='Punteggio di']") %>% html_text()
# Append the titles, ratings, prices from the current page to the vector
titles <- c(titles, titles_page)
prices <- c(prices, prices_page)
ratings <- c(ratings, ratings_page)
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论