英文:
web scraping with loop
问题
我试图下载不同页面上的邮政编码。我从墨西哥城内每个自治市的节点列表开始。
url <- "https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/"
resource <- GET(url)
parse <- htmlParse(resource)
links <- as.character(xpathSApply(parse, path = "//a", xmlGetAttr, "href"))
print(links)
然后,我尝试创建一个循环,获取每个 URL 并提取邮政编码表格,以后创建每个自治市创建的矩阵的大语料库。
scraper <- function(url){
html <- read_html(url)
tabla <- html %>%
html_elements("td , th") %>%
html_text2()
data <- matrix(ncol = 3, nrow = length(tabla))
data <- data.frame(matrix(tabla, nrow = length(tabla), ncol = 3, byrow = TRUE)) %>%
row_to_names(row_number = 1)
}
我将拥有 "municipality"、"locality" 和 "zp",因此列数为 3,但似乎出现了错误:"Error: x
must be a string of length 1",而且我也无法合并所有矩阵。有任何想法都会非常感激!
英文:
I'm trying to download zip codes that are in different pages. I started with a list of nodes for each municipality inside Mexico City.
url<-"https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/"
resource<-GET(url)
parse<-htmlParse(resource)
links<-as.character(xpathSApply(parse,path="//a",xmlGetAttr,"href"))
print(links)
And then I'm trying to create a loop that grabs each url and grabs the table of zip codes to later create a big corpus of each matrix created per municipality:
scraper<-function(url){
html<-read_html(url)
tabla<-html%>%
html_elements("td , th") %>%
html_text2()
data<-matrix(ncol=3,nrow=length(tabla))
data<-data.frame(matrix(tabla,nrow=length(tabla),ncol=3,byrow=TRUE)) %>%
row_to_names(row_number=1)
}
I will have "municipality", "locality", "zp", that's why the number of columns is 3, but it seems that:
"Error: x
must be a string of length 1" and I also cannot add up all the matrices.
Any ideas are greatly appreciated!
答案1
得分: 1
这是一种获取墨西哥城邮政编码的方法。
suppressPackageStartupMessages({
library(rvest)
library(magrittr)
})
scraper <- function(link) {
link %>%
read_html() %>%
html_table() %>%
`[[`(1)
}
url <- "https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/"
page <- read_html(url)
zip_codes_list <- page %>%
html_elements("a") %>%
html_attr("href") %>%
grep("mexico/Ciudad-de-Mexico/.+", ., value = TRUE) %>%
lapply(scraper)
然后将它们全部合并在一起。
zip_codes <- do.call(rbind, zip_codes_list)
编辑
在原始帖子中,我加载了dplyr
包。经过再次考虑,我意识到它只是加载了相关包magrittr
,以使magrittr
管道运算符可用,因此我已更改代码只加载相关包magrittr
。
英文:
Here is a way to scrape the zip codes of Ciudad-de-Mexico.
suppressPackageStartupMessages({
library(rvest)
library(magrittr)
})
scraper <- function(link) {
link %>%
read_html() %>%
html_table() %>%
`[[`(1)
}
url <- "https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/"
page <- read_html(url)
zip_codes_list <- page %>%
html_elements("a") %>%
html_attr("href") %>%
grep("mexico/Ciudad-de-Mexico/.+", ., value = TRUE) %>%
lapply(scraper)
Then rbind
them all together.
zip_codes <- do.call(rbind, zip_codes_list)
Edit
In the original post I have loaded package dplyr
. After a second thought I have realized that it's only loaded to make the magrittr
pipe operator available, so I have changed the code to only load the relevant package, magrittr
.
答案2
得分: 1
library(tidyverse)
library(rvest)
"https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/" %>%
read_html() %>%
html_elements(".ctrLink a") %>%
html_attr("href") %>% # 从主URL中获取所有链接
map_dfr(~ .x %>% # 遍历市政府,抓取表格并合并行
read_html() %>%
html_table())
# 最后考虑 janitor::clean_names()
# 一个数据框:2,014 行 x 3 列
# 市政府 地区 邮政编码
# <chr> <chr> <int>
# 1 阿尔瓦罗·奥布雷贡 第1扩建总统 1299
# 2 阿尔瓦罗·奥布雷贡 第1部分Cañada 1269
# 3 阿尔瓦罗·奥布雷贡 第1胜利 1160
# 4 阿尔瓦罗·奥布雷贡 第2扩建总统 1299
# 5 阿尔瓦罗·奥布雷贡 第2 Del Moral del Pueblo de Tetelpan 1700
# 6 阿尔瓦罗·奥布雷贡 第2部分Cañada 1269
# 7 阿尔瓦罗·奥布雷贡 第2 Reacomodo Tlacuitlapa 1650
# 8 阿尔瓦罗·奥布雷贡 8月8日 1180
# 9 阿尔瓦罗·奥布雷贡 云杉 1440
# 10 阿尔瓦罗·奥布雷贡 亚伯拉罕·M.冈萨雷斯 1170
# … 其他 2,004 行
# 通过 `print(n = ...)` 查看更多行
英文:
library(tidyverse)
library(rvest)
"https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/" %>%
read_html() %>%
html_elements(".ctrLink a") %>%
html_attr("href") %>% # Grab all links from the main URL
map_dfr(~ .x %>% # Map through municipalities, scrape tables and row bind
read_html() %>%
html_table())
# Consider janitor::clean_names() at the end
# A tibble: 2,014 × 3
Municipio Localidad `Código Postal`
<chr> <chr> <int>
1 Alvaro Obregon 1a Ampliación Presidentes 1299
2 Alvaro Obregon 1a Sección Cañada 1269
3 Alvaro Obregon 1a Victoria 1160
4 Alvaro Obregon 2a Ampliación Presidentes 1299
5 Alvaro Obregon 2a Del Moral del Pueblo de Tetelpan 1700
6 Alvaro Obregon 2a Sección Cañada 1269
7 Alvaro Obregon 2o Reacomodo Tlacuitlapa 1650
8 Alvaro Obregon 8 de Agosto 1180
9 Alvaro Obregon Abeto 1440
10 Alvaro Obregon Abraham M. González 1170
# … with 2,004 more rows
# ℹ Use `print(n = ...)` to see more rows
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论