如何在使用R进行网页抓取时,当网页上没有数值时报告NA?

huangapple go评论97阅读模式
英文:

How to report NA when scraping a web with R and it does not have value?

问题

我从booking.com的页面上进行爬取并创建数据框,我注意到并不是所有的酒店都有评分。

我尝试了以下方法:

# 从页面的Inspect代码中获取元素
titles_page <- page %>% html_elements("div[data-testid='title'][class='fcab3ed991 a23c043802']") %>% html_text()
prices_page <- page %>% html_elements("span[data-testid='price-and-discounted-price']") %>% html_text()
ratings_page <- page %>% html_elements("div[aria-label^='Punteggio di']") %>% html_text()

# 评分变量
tryCatch(expr ={
      ratings_page <- remDr$findElements(using = "xpath", value = "div[aria-label^='Punteggio di']")$getElementAttribute('value')
    },   
    # 如果信息不存在,将ratings元素设为NA
    error = function(e){          
      ratings_page <- NA
    })

但是没有改变任何东西。

在对象没有值的情况下如何报告NA?

英文:

I am scraping from a page in booking.com and creating the dataframe I have noticed that not all hotels have ratings.

I tried this for example:

# Got the elements from Inspect code of the page
titles_page &lt;- page %&gt;% html_elements(&quot;div[data-testid=&#39;title&#39;][class=&#39;fcab3ed991 a23c043802&#39;]&quot;) %&gt;% html_text()
prices_page &lt;- page %&gt;% html_elements(&quot;span[data-testid=&#39;price-and-discounted-price&#39;]&quot;) %&gt;% html_text()
ratings_page &lt;- page %&gt;% html_elements(&quot;div[aria-label^=&#39;Punteggio di&#39;]&quot;) %&gt;% html_text()

# The variable ratings
tryCatch(expr ={
      ratings_page &lt;- remDr$findElements(using = &quot;xpath&quot;, value = &quot;div[aria-label^=&#39;Punteggio di&#39;]&quot;)$getElementAttribute(&#39;value&#39;)
    },   
    #If the information does not exist in this way it writes NA to the ratings element
    error = function(e){          
      ratings_page &lt;-NA
    })

And it does not change anything.

How to report NA where the object does not have value?

The link

答案1

得分: 0

可能是这样的。未经测试。

# 变量 ratings
ratings_page &lt;- tryCatch(
  expr = {
    elem &lt;- remDr$findElements(using = &quot;xpath&quot;, value = &quot;div[aria-label^=&#39;Punteggio di&#39;]&quot;)
    elem$getElementAttribute(&#39;value&#39;)
  },   
  # 如果信息不存在,将 NA 写入 ratings 元素
  error = function(e) NA
)
英文:

Maybe something like the following. Untested.

# The variable ratings
ratings_page &lt;- tryCatch(
  expr = {
    elem &lt;- remDr$findElements(using = &quot;xpath&quot;, value = &quot;div[aria-label^=&#39;Punteggio di&#39;]&quot;)
    elem$getElementAttribute(&#39;value&#39;)
  },   
  # If the information does not exist in this way it writes NA to the ratings element
  error = function(e) NA
)

答案2

得分: 0

这是一个基于这个链接策略的解决方案:https://stackoverflow.com/questions/56673908/how-do-you-scrape-items-together-so-you-dont-lose-the-index/56675147#56675147。

关键在于使用 html_element()(没有s)。html_element() 将始终返回一个答案,即使它是NA。这样,如果父节点中缺少元素,NA 将填充这些间隙。

library(rvest)
library(dplyr)

# 读取页面
url <- "https://www.booking.com/searchresults.it.html?ss=Firenze%2C+Toscana%2C+Italia&amp;efdco=1&amp;label=booking-name-L*Xf2U1sq4*GEkIwcLOALQS267777916051%3Apl%3Ata%3Ap1%3Ap22%2C563%2C000%3Aac%3Aap%3Aneg%3Afi%3Atikwd-65526620%3Alp9069992%3Ali%3Adec%3Adm%3Appccp&amp;aid=376363&amp;lang=it&amp;sb=1&amp;src_elem=sb&amp;src=index&amp;dest_id=-117543&amp;dest_type=city&amp;ac_position=0&amp;ac_click_type=b&amp;ac_langcode=it&amp;ac_suggestion_list_length=5&amp;search_selected=true&amp;search_pageview_id=2e375b14ad810329&amp;ac_meta=GhAyZTM3NWIxNGFkODEwMzI5IAAoATICaXQ6BGZpcmVAAEoAUAA%3D&amp;checkin=2023-06-11&amp;checkout=2023-06-18&amp;group_adults=2&amp;no_rooms=1&amp;group_children=0&amp;sb_travel_purpose=leisure&amp;fbclid=IwAR1BGskP8uicO9nlm5aW7U1A9eABbSwhMNNeQ0gQ-PNoRkHP859L7u0fIsE"
page <- read_html(url)

# 解析出每个父节点
properties <- html_elements(page, xpath=".//div[@data-testid='property-card']")

# 现在从每个父节点中找到信息
# 注意:html_element - 没有 "s"
title <- properties %>% html_element("div[data-testid='title']") %>% html_text()
prices <- properties %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()    
ratings <- properties %>% html_element(xpath=".//div[@aria-label]") %>% html_text()

data.frame(title, prices, ratings)

标题,价格和评分的数据已经翻译完毕。

英文:

Here is a solution based on the strategy from this link: https://stackoverflow.com/questions/56673908/how-do-you-scrape-items-together-so-you-dont-lose-the-index/56675147#56675147.

The key here is using html_element() (without the s). html_element() will always return an answer even if it is NA. This way if the element is missing in the parent node, NA will fill the gaps.

library(rvest)
library(dplyr)

#read the page
url &lt;-&quot;https://www.booking.com/searchresults.it.html?ss=Firenze%2C+Toscana%2C+Italia&amp;efdco=1&amp;label=booking-name-L*Xf2U1sq4*GEkIwcLOALQS267777916051%3Apl%3Ata%3Ap1%3Ap22%2C563%2C000%3Aac%3Aap%3Aneg%3Afi%3Atikwd-65526620%3Alp9069992%3Ali%3Adec%3Adm%3Appccp&amp;aid=376363&amp;lang=it&amp;sb=1&amp;src_elem=sb&amp;src=index&amp;dest_id=-117543&amp;dest_type=city&amp;ac_position=0&amp;ac_click_type=b&amp;ac_langcode=it&amp;ac_suggestion_list_length=5&amp;search_selected=true&amp;search_pageview_id=2e375b14ad810329&amp;ac_meta=GhAyZTM3NWIxNGFkODEwMzI5IAAoATICaXQ6BGZpcmVAAEoAUAA%3D&amp;checkin=2023-06-11&amp;checkout=2023-06-18&amp;group_adults=2&amp;no_rooms=1&amp;group_children=0&amp;sb_travel_purpose=leisure&amp;fbclid=IwAR1BGskP8uicO9nlm5aW7U1A9eABbSwhMNNeQ0gQ-PNoRkHP859L7u0fIsE&quot;
page &lt;- read_html(url)

#parse out the parent node for each parent 
properties &lt;- html_elements(page, xpath=&quot;.//div[@data-testid=&#39;property-card&#39;]&quot;)

#now find the information from each parent
#notice html_element - no &quot;s&quot;
title &lt;- properties %&gt;% html_element(&quot;div[data-testid=&#39;title&#39;]&quot;) %&gt;% html_text()
prices &lt;- properties %&gt;% html_element(&quot;span[data-testid=&#39;price-and-discounted-price&#39;]&quot;) %&gt;% html_text()    
ratings &lt;- properties %&gt;% html_element(xpath=&quot;.//div[@aria-label]&quot;) %&gt;% html_text()

data.frame(title, prices, ratings)

                                       title   prices ratings
1                   Sweetly home in Florence US$1.918    &lt;NA&gt;
2                                   Pepi Red US$3.062        
3                 hu Firenze Camping in Town   US$902     8,4
4                              Plus Florence US$1.754     7,9
5                     Artemente Florence B&amp;B US$4.276        
6                                Villa Aruch US$1.658        
7                                Hotel Berna US$2.184        
8                                Hotel Gioia US$2.437        
9                              Hotel Magenta US$3.250        
10                              Villa Neroli US$3.242        
11                       Residenza Florentia US$2.792     8,0
12                Ridolfi Sei Suite Florence US$1.243    &lt;NA&gt;
...

huangapple
  • 本文由 发表于 2023年4月17日 03:00:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76029792.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定