2020年1月6日 23:42:35go评论90阅读模式

英文:

Scraping pages with inconsistent lengths in dataframe

问题

以下是您要求翻译的代码部分：

我想从[此页面](https://www.zorgkaartnederland.nl/huisarts)中提取所有的姓名。希望得到一个三列的tibble结果。我的代码只有在所有数据都存在时才能正常工作，因此出现了错误：
     错误：Tibble列必须具有一致的长度，只有长度为一的值才会被循环使用：
    * 长度20：列`huisarts`，`url`
    * 长度21：列`praktijk`
我如何让我的代码运行，但如果数据不存在，则在`tibble`中填充`NA`呢？
用于稍后在爬取函数中使用的暂停机器人代码：
    pauzing_robot <- function (periods = c(0, 1)) {
          tictoc <- runif(1, periods[1], periods[2])
          cat(paste0(Sys.time()), 
              "- Sleeping for ", round(tictoc, 2), "seconds\n")
          Sys.sleep(tictoc)
        }
爬虫：
    library(tidyverse)
    library(rvest)
    scrape_page <- function(pagina_nummer) {
      
      page <- read_html(paste0("https://www.zorgkaartnederland.nl/huisarts/pagina", pagina_nummer)) 
      
      pauzing_robot(periods = c(0, 1.5))
      
      tibble(
        
        huisarts = page %>% 
          html_nodes(".media-heading.title.orange") %>% 
          html_text() %>% 
          str_trim(), 
        
        praktijk = page %>% 
          html_nodes(".location") %>% 
          html_text() %>%
          str_trim(),
        
        url = page %>% 
          html_nodes(".media-heading.title.orange") %>% 
          html_nodes("a") %>%
          html_attr("href") %>% 
          str_trim() %>% 
          paste0("https://www.zorgkaartnederland.nl", .)
      )
    }
    
总共有445页，但为了示例，只爬取了三页：
    huisartsen <- map_df(sample(1:3), scrape_page)
问题似乎出现在第2页，因为长度不一致，以下代码可以正常工作：
    huisartsen <- map_df(3:4, scrape_page)
如果可能的话，使用`tidyverse`代码。提前谢谢。

英文:

I want to scrape all the names from this page. With the result of one tibble of three columns. My code only works if all the data is there hence my error:

 Error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 20: Columns `huisarts`, `url`
* Length 21: Column `praktijk`

How can I let my code run but fill with Na's in tibble if the data isn't there.

My code for a pauzing robot later used in scraper function:

pauzing_robot &lt;- function (periods = c(0, 1)) {
      tictoc &lt;- runif(1, periods[1], periods[2])
      cat(paste0(Sys.time()), 
          &quot;- Sleeping for &quot;, round(tictoc, 2), &quot;seconds\n&quot;)
      Sys.sleep(tictoc)
    }

Scraper:

library(tidyverse)
library(rvest)
scrape_page &lt;- function(pagina_nummer) {
  
  page &lt;- read_html(paste0(&quot;https://www.zorgkaartnederland.nl/huisarts/pagina&quot;, pagina_nummer)) 
  
  pauzing_robot(periods = c(0, 1.5))
  
  tibble(
    
    huisarts = page %&gt;% 
      html_nodes(&quot;.media-heading.title.orange&quot;) %&gt;% 
      html_text() %&gt;% 
      str_trim(), 
    
    praktijk = page %&gt;% 
      html_nodes(&quot;.location&quot;) %&gt;% 
      html_text() %&gt;%
      str_trim(),
    
    url = page %&gt;% 
      html_nodes(&quot;.media-heading.title.orange&quot;) %&gt;% 
      html_nodes(&quot;a&quot;) %&gt;%
      html_attr(&quot;href&quot;) %&gt;% 
      str_trim() %&gt;% 
      paste0(&quot;https://www.zorgkaartnederland.nl&quot;, .)
  )
}

Total number of pages 445, but for example sake only scraping three:

huisartsen &lt;- map_df(sample(1:3), scrape_page)

Page 2 seems to be the problem with inconsistent lengths because this code works:

huisartsen &lt;- map_df(3:4, scrape_page)

If possible with tidyverse code. Thanks in advance.

答案1

得分: 3

以下是翻译好的部分：

需要检索父节点列表

parents <- page %>% html_nodes("li.media")

然后使用html_node()函数解析父节点。

tibble(
    huisarts = parents %>%
        html_node(".media-heading.title.orange") %>%
        html_text() %>%
        str_trim(),
    praktijk = parents %>%
        html_node(".location") %>%
        html_text() %>%
        str_trim(),
    url = parents %>%
        html_node(".media-heading.title.orange a") %>%
        html_attr("href") %>%
        str_trim() %>%
        paste0("https://www.zorgkaartnederland.nl", .)
)

html_node函数始终会返回一个值，即使只是NA。

英文:

You need to retrieve the list of parent nodes

parents &lt;- page %&gt;% html_nodes(&quot;li.media&quot;)

Then parse the parent nodes with function html_node().

tibble(
    huisarts = parents %&gt;% 
      html_node(&quot;.media-heading.title.orange&quot;) %&gt;% 
      html_text() %&gt;% 
      str_trim(), 
    praktijk = parents %&gt;% 
      html_node(&quot;.location&quot;) %&gt;% 
      html_text() %&gt;%
      str_trim(),
    url = parents %&gt;% 
      html_node(&quot;.media-heading.title.orange a&quot;) %&gt;% 
      html_attr(&quot;href&quot;) %&gt;% 
      str_trim() %&gt;% 
      paste0(&quot;https://www.zorgkaartnederland.nl&quot;, .)
  )

The html_node function will always return a value even if it is just a NA

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在数据框中抓取页面长度不一致的部分。

问题

答案1

将表格转换为LaTeX并调用内联R Markdown

如何将具有重复行的数据框重塑为行名称和列名称

bookdown从右到左的方向

在R中对上一个结果应用函数，最少的键入方式是：

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。