英文:
How to report NA when scraping a web with R and it does not have value?
问题
我从booking.com的页面上进行爬取并创建数据框,我注意到并不是所有的酒店都有评分。
我尝试了以下方法:
# 从页面的Inspect代码中获取元素
titles_page <- page %>% html_elements("div[data-testid='title'][class='fcab3ed991 a23c043802']") %>% html_text()
prices_page <- page %>% html_elements("span[data-testid='price-and-discounted-price']") %>% html_text()
ratings_page <- page %>% html_elements("div[aria-label^='Punteggio di']") %>% html_text()
# 评分变量
tryCatch(expr ={
ratings_page <- remDr$findElements(using = "xpath", value = "div[aria-label^='Punteggio di']")$getElementAttribute('value')
},
# 如果信息不存在,将ratings元素设为NA
error = function(e){
ratings_page <- NA
})
但是没有改变任何东西。
在对象没有值的情况下如何报告NA?
英文:
I am scraping from a page in booking.com and creating the dataframe I have noticed that not all hotels have ratings.
I tried this for example:
# Got the elements from Inspect code of the page
titles_page <- page %>% html_elements("div[data-testid='title'][class='fcab3ed991 a23c043802']") %>% html_text()
prices_page <- page %>% html_elements("span[data-testid='price-and-discounted-price']") %>% html_text()
ratings_page <- page %>% html_elements("div[aria-label^='Punteggio di']") %>% html_text()
# The variable ratings
tryCatch(expr ={
ratings_page <- remDr$findElements(using = "xpath", value = "div[aria-label^='Punteggio di']")$getElementAttribute('value')
},
#If the information does not exist in this way it writes NA to the ratings element
error = function(e){
ratings_page <-NA
})
And it does not change anything.
How to report NA where the object does not have value?
答案1
得分: 0
可能是这样的。未经测试。
# 变量 ratings
ratings_page <- tryCatch(
expr = {
elem <- remDr$findElements(using = "xpath", value = "div[aria-label^='Punteggio di']")
elem$getElementAttribute('value')
},
# 如果信息不存在,将 NA 写入 ratings 元素
error = function(e) NA
)
英文:
Maybe something like the following. Untested.
# The variable ratings
ratings_page <- tryCatch(
expr = {
elem <- remDr$findElements(using = "xpath", value = "div[aria-label^='Punteggio di']")
elem$getElementAttribute('value')
},
# If the information does not exist in this way it writes NA to the ratings element
error = function(e) NA
)
答案2
得分: 0
这是一个基于这个链接策略的解决方案:https://stackoverflow.com/questions/56673908/how-do-you-scrape-items-together-so-you-dont-lose-the-index/56675147#56675147。
关键在于使用 html_element()
(没有s)。html_element()
将始终返回一个答案,即使它是NA。这样,如果父节点中缺少元素,NA 将填充这些间隙。
library(rvest)
library(dplyr)
# 读取页面
url <- "https://www.booking.com/searchresults.it.html?ss=Firenze%2C+Toscana%2C+Italia&efdco=1&label=booking-name-L*Xf2U1sq4*GEkIwcLOALQS267777916051%3Apl%3Ata%3Ap1%3Ap22%2C563%2C000%3Aac%3Aap%3Aneg%3Afi%3Atikwd-65526620%3Alp9069992%3Ali%3Adec%3Adm%3Appccp&aid=376363&lang=it&sb=1&src_elem=sb&src=index&dest_id=-117543&dest_type=city&ac_position=0&ac_click_type=b&ac_langcode=it&ac_suggestion_list_length=5&search_selected=true&search_pageview_id=2e375b14ad810329&ac_meta=GhAyZTM3NWIxNGFkODEwMzI5IAAoATICaXQ6BGZpcmVAAEoAUAA%3D&checkin=2023-06-11&checkout=2023-06-18&group_adults=2&no_rooms=1&group_children=0&sb_travel_purpose=leisure&fbclid=IwAR1BGskP8uicO9nlm5aW7U1A9eABbSwhMNNeQ0gQ-PNoRkHP859L7u0fIsE"
page <- read_html(url)
# 解析出每个父节点
properties <- html_elements(page, xpath=".//div[@data-testid='property-card']")
# 现在从每个父节点中找到信息
# 注意:html_element - 没有 "s"
title <- properties %>% html_element("div[data-testid='title']") %>% html_text()
prices <- properties %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()
ratings <- properties %>% html_element(xpath=".//div[@aria-label]") %>% html_text()
data.frame(title, prices, ratings)
标题,价格和评分的数据已经翻译完毕。
英文:
Here is a solution based on the strategy from this link: https://stackoverflow.com/questions/56673908/how-do-you-scrape-items-together-so-you-dont-lose-the-index/56675147#56675147.
The key here is using html_element()
(without the s). html_element()
will always return an answer even if it is NA. This way if the element is missing in the parent node, NA will fill the gaps.
library(rvest)
library(dplyr)
#read the page
url <-"https://www.booking.com/searchresults.it.html?ss=Firenze%2C+Toscana%2C+Italia&efdco=1&label=booking-name-L*Xf2U1sq4*GEkIwcLOALQS267777916051%3Apl%3Ata%3Ap1%3Ap22%2C563%2C000%3Aac%3Aap%3Aneg%3Afi%3Atikwd-65526620%3Alp9069992%3Ali%3Adec%3Adm%3Appccp&aid=376363&lang=it&sb=1&src_elem=sb&src=index&dest_id=-117543&dest_type=city&ac_position=0&ac_click_type=b&ac_langcode=it&ac_suggestion_list_length=5&search_selected=true&search_pageview_id=2e375b14ad810329&ac_meta=GhAyZTM3NWIxNGFkODEwMzI5IAAoATICaXQ6BGZpcmVAAEoAUAA%3D&checkin=2023-06-11&checkout=2023-06-18&group_adults=2&no_rooms=1&group_children=0&sb_travel_purpose=leisure&fbclid=IwAR1BGskP8uicO9nlm5aW7U1A9eABbSwhMNNeQ0gQ-PNoRkHP859L7u0fIsE"
page <- read_html(url)
#parse out the parent node for each parent
properties <- html_elements(page, xpath=".//div[@data-testid='property-card']")
#now find the information from each parent
#notice html_element - no "s"
title <- properties %>% html_element("div[data-testid='title']") %>% html_text()
prices <- properties %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()
ratings <- properties %>% html_element(xpath=".//div[@aria-label]") %>% html_text()
data.frame(title, prices, ratings)
title prices ratings
1 Sweetly home in Florence US$1.918 <NA>
2 Pepi Red US$3.062
3 hu Firenze Camping in Town US$902 8,4
4 Plus Florence US$1.754 7,9
5 Artemente Florence B&B US$4.276
6 Villa Aruch US$1.658
7 Hotel Berna US$2.184
8 Hotel Gioia US$2.437
9 Hotel Magenta US$3.250
10 Villa Neroli US$3.242
11 Residenza Florentia US$2.792 8,0
12 Ridolfi Sei Suite Florence US$1.243 <NA>
...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论