在数据框中抓取页面长度不一致的部分。

huangapple go评论70阅读模式
英文:

Scraping pages with inconsistent lengths in dataframe

问题

以下是您要求翻译的代码部分:

我想从[此页面](https://www.zorgkaartnederland.nl/huisarts)中提取所有的姓名。希望得到一个三列的tibble结果。我的代码只有在所有数据都存在时才能正常工作,因此出现了错误:

     错误:Tibble列必须具有一致的长度,只有长度为一的值才会被循环使用:
    * 长度20:列`huisarts`,`url`
    * 长度21:列`praktijk`

我如何让我的代码运行,但如果数据不存在,则在`tibble`中填充`NA`呢?

用于稍后在爬取函数中使用的暂停机器人代码:

    pauzing_robot <- function (periods = c(0, 1)) {
          tictoc <- runif(1, periods[1], periods[2])
          cat(paste0(Sys.time()), 
              "- Sleeping for ", round(tictoc, 2), "seconds\n")
          Sys.sleep(tictoc)
        }

爬虫:

    library(tidyverse)
    library(rvest)

    scrape_page <- function(pagina_nummer) {
      
      page <- read_html(paste0("https://www.zorgkaartnederland.nl/huisarts/pagina", pagina_nummer)) 
      
      pauzing_robot(periods = c(0, 1.5))
      
      tibble(
        
        huisarts = page %>% 
          html_nodes(".media-heading.title.orange") %>% 
          html_text() %>% 
          str_trim(), 
        
        praktijk = page %>% 
          html_nodes(".location") %>% 
          html_text() %>%
          str_trim(),
        
        url = page %>% 
          html_nodes(".media-heading.title.orange") %>% 
          html_nodes("a") %>%
          html_attr("href") %>% 
          str_trim() %>% 
          paste0("https://www.zorgkaartnederland.nl", .)
      )
    }
    
总共有445页,但为了示例,只爬取了三页:

    huisartsen <- map_df(sample(1:3), scrape_page)

问题似乎出现在第2页,因为长度不一致,以下代码可以正常工作:

    huisartsen <- map_df(3:4, scrape_page)

如果可能的话,使用`tidyverse`代码。提前谢谢。
英文:

I want to scrape all the names from this page. With the result of one tibble of three columns. My code only works if all the data is there hence my error:

 Error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 20: Columns `huisarts`, `url`
* Length 21: Column `praktijk`

How can I let my code run but fill with Na's in tibble if the data isn't there.

My code for a pauzing robot later used in scraper function:

pauzing_robot &lt;- function (periods = c(0, 1)) {
      tictoc &lt;- runif(1, periods[1], periods[2])
      cat(paste0(Sys.time()), 
          &quot;- Sleeping for &quot;, round(tictoc, 2), &quot;seconds\n&quot;)
      Sys.sleep(tictoc)
    }

Scraper:

library(tidyverse)
library(rvest)

scrape_page &lt;- function(pagina_nummer) {
  
  page &lt;- read_html(paste0(&quot;https://www.zorgkaartnederland.nl/huisarts/pagina&quot;, pagina_nummer)) 
  
  pauzing_robot(periods = c(0, 1.5))
  
  tibble(
    
    huisarts = page %&gt;% 
      html_nodes(&quot;.media-heading.title.orange&quot;) %&gt;% 
      html_text() %&gt;% 
      str_trim(), 
    
    praktijk = page %&gt;% 
      html_nodes(&quot;.location&quot;) %&gt;% 
      html_text() %&gt;%
      str_trim(),
    
    url = page %&gt;% 
      html_nodes(&quot;.media-heading.title.orange&quot;) %&gt;% 
      html_nodes(&quot;a&quot;) %&gt;%
      html_attr(&quot;href&quot;) %&gt;% 
      str_trim() %&gt;% 
      paste0(&quot;https://www.zorgkaartnederland.nl&quot;, .)
  )
}

Total number of pages 445, but for example sake only scraping three:

huisartsen &lt;- map_df(sample(1:3), scrape_page)

Page 2 seems to be the problem with inconsistent lengths because this code works:

huisartsen &lt;- map_df(3:4, scrape_page)

If possible with tidyverse code. Thanks in advance.

答案1

得分: 3

以下是翻译好的部分:

需要检索父节点列表

parents <- page %>% html_nodes("li.media")

然后使用html_node()函数解析父节点。

tibble(
    huisarts = parents %>%
        html_node(".media-heading.title.orange") %>%
        html_text() %>%
        str_trim(),

    praktijk = parents %>%
        html_node(".location") %>%
        html_text() %>%
        str_trim(),

    url = parents %>%
        html_node(".media-heading.title.orange a") %>%
        html_attr("href") %>%
        str_trim() %>%
        paste0("https://www.zorgkaartnederland.nl", .)
)

html_node函数始终会返回一个值,即使只是NA。

英文:

You need to retrieve the list of parent nodes

parents &lt;- page %&gt;% html_nodes(&quot;li.media&quot;)

Then parse the parent nodes with function html_node().

tibble(
    huisarts = parents %&gt;% 
      html_node(&quot;.media-heading.title.orange&quot;) %&gt;% 
      html_text() %&gt;% 
      str_trim(), 

    praktijk = parents %&gt;% 
      html_node(&quot;.location&quot;) %&gt;% 
      html_text() %&gt;%
      str_trim(),

    url = parents %&gt;% 
      html_node(&quot;.media-heading.title.orange a&quot;) %&gt;% 
      html_attr(&quot;href&quot;) %&gt;% 
      str_trim() %&gt;% 
      paste0(&quot;https://www.zorgkaartnederland.nl&quot;, .)
  ) 

The html_node function will always return a value even if it is just a NA

huangapple
  • 本文由 发表于 2020年1月6日 23:42:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/59614978.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定