英文:
Scraping pages with inconsistent lengths in dataframe
问题
以下是您要求翻译的代码部分:
我想从[此页面](https://www.zorgkaartnederland.nl/huisarts)中提取所有的姓名。希望得到一个三列的tibble结果。我的代码只有在所有数据都存在时才能正常工作,因此出现了错误:
错误:Tibble列必须具有一致的长度,只有长度为一的值才会被循环使用:
* 长度20:列`huisarts`,`url`
* 长度21:列`praktijk`
我如何让我的代码运行,但如果数据不存在,则在`tibble`中填充`NA`呢?
用于稍后在爬取函数中使用的暂停机器人代码:
pauzing_robot <- function (periods = c(0, 1)) {
tictoc <- runif(1, periods[1], periods[2])
cat(paste0(Sys.time()),
"- Sleeping for ", round(tictoc, 2), "seconds\n")
Sys.sleep(tictoc)
}
爬虫:
library(tidyverse)
library(rvest)
scrape_page <- function(pagina_nummer) {
page <- read_html(paste0("https://www.zorgkaartnederland.nl/huisarts/pagina", pagina_nummer))
pauzing_robot(periods = c(0, 1.5))
tibble(
huisarts = page %>%
html_nodes(".media-heading.title.orange") %>%
html_text() %>%
str_trim(),
praktijk = page %>%
html_nodes(".location") %>%
html_text() %>%
str_trim(),
url = page %>%
html_nodes(".media-heading.title.orange") %>%
html_nodes("a") %>%
html_attr("href") %>%
str_trim() %>%
paste0("https://www.zorgkaartnederland.nl", .)
)
}
总共有445页,但为了示例,只爬取了三页:
huisartsen <- map_df(sample(1:3), scrape_page)
问题似乎出现在第2页,因为长度不一致,以下代码可以正常工作:
huisartsen <- map_df(3:4, scrape_page)
如果可能的话,使用`tidyverse`代码。提前谢谢。
英文:
I want to scrape all the names from this page. With the result of one tibble of three columns. My code only works if all the data is there hence my error:
Error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 20: Columns `huisarts`, `url`
* Length 21: Column `praktijk`
How can I let my code run but fill with Na
's in tibble
if the data isn't there.
My code for a pauzing robot later used in scraper function:
pauzing_robot <- function (periods = c(0, 1)) {
tictoc <- runif(1, periods[1], periods[2])
cat(paste0(Sys.time()),
"- Sleeping for ", round(tictoc, 2), "seconds\n")
Sys.sleep(tictoc)
}
Scraper:
library(tidyverse)
library(rvest)
scrape_page <- function(pagina_nummer) {
page <- read_html(paste0("https://www.zorgkaartnederland.nl/huisarts/pagina", pagina_nummer))
pauzing_robot(periods = c(0, 1.5))
tibble(
huisarts = page %>%
html_nodes(".media-heading.title.orange") %>%
html_text() %>%
str_trim(),
praktijk = page %>%
html_nodes(".location") %>%
html_text() %>%
str_trim(),
url = page %>%
html_nodes(".media-heading.title.orange") %>%
html_nodes("a") %>%
html_attr("href") %>%
str_trim() %>%
paste0("https://www.zorgkaartnederland.nl", .)
)
}
Total number of pages 445, but for example sake only scraping three:
huisartsen <- map_df(sample(1:3), scrape_page)
Page 2 seems to be the problem with inconsistent lengths because this code works:
huisartsen <- map_df(3:4, scrape_page)
If possible with tidyverse
code. Thanks in advance.
答案1
得分: 3
以下是翻译好的部分:
需要检索父节点列表
parents <- page %>% html_nodes("li.media")
然后使用html_node()
函数解析父节点。
tibble(
huisarts = parents %>%
html_node(".media-heading.title.orange") %>%
html_text() %>%
str_trim(),
praktijk = parents %>%
html_node(".location") %>%
html_text() %>%
str_trim(),
url = parents %>%
html_node(".media-heading.title.orange a") %>%
html_attr("href") %>%
str_trim() %>%
paste0("https://www.zorgkaartnederland.nl", .)
)
html_node
函数始终会返回一个值,即使只是NA。
英文:
You need to retrieve the list of parent nodes
parents <- page %>% html_nodes("li.media")
Then parse the parent nodes with function html_node()
.
tibble(
huisarts = parents %>%
html_node(".media-heading.title.orange") %>%
html_text() %>%
str_trim(),
praktijk = parents %>%
html_node(".location") %>%
html_text() %>%
str_trim(),
url = parents %>%
html_node(".media-heading.title.orange a") %>%
html_attr("href") %>%
str_trim() %>%
paste0("https://www.zorgkaartnederland.nl", .)
)
The html_node
function will always return a value even if it is just a NA
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论