网页抓取循环

huangapple go评论66阅读模式
英文:

web scraping with loop

问题

我试图下载不同页面上的邮政编码。我从墨西哥城内每个自治市的节点列表开始。

url <- "https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/"
resource <- GET(url)
parse <- htmlParse(resource)
links <- as.character(xpathSApply(parse, path = "//a", xmlGetAttr, "href"))
print(links)

然后,我尝试创建一个循环,获取每个 URL 并提取邮政编码表格,以后创建每个自治市创建的矩阵的大语料库。

scraper <- function(url){
  html <- read_html(url)
  tabla <- html %>%
    html_elements("td , th") %>%
    html_text2()
  data <- matrix(ncol = 3, nrow = length(tabla))
  data <- data.frame(matrix(tabla, nrow = length(tabla), ncol = 3, byrow = TRUE)) %>%
    row_to_names(row_number = 1)
}

我将拥有 "municipality"、"locality" 和 "zp",因此列数为 3,但似乎出现了错误:"Error: x must be a string of length 1",而且我也无法合并所有矩阵。有任何想法都会非常感激!

英文:

I'm trying to download zip codes that are in different pages. I started with a list of nodes for each municipality inside Mexico City.

url&lt;-&quot;https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/&quot;
resource&lt;-GET(url)
parse&lt;-htmlParse(resource)
links&lt;-as.character(xpathSApply(parse,path=&quot;//a&quot;,xmlGetAttr,&quot;href&quot;))
print(links)

And then I'm trying to create a loop that grabs each url and grabs the table of zip codes to later create a big corpus of each matrix created per municipality:

scraper&lt;-function(url){
  html&lt;-read_html(url)
  tabla&lt;-html%&gt;%
    html_elements(&quot;td , th&quot;) %&gt;%
    html_text2()
  data&lt;-matrix(ncol=3,nrow=length(tabla))
  data&lt;-data.frame(matrix(tabla,nrow=length(tabla),ncol=3,byrow=TRUE)) %&gt;% 
    row_to_names(row_number=1)
}

I will have "municipality", "locality", "zp", that's why the number of columns is 3, but it seems that:
"Error: x must be a string of length 1" and I also cannot add up all the matrices.
Any ideas are greatly appreciated!

答案1

得分: 1

这是一种获取墨西哥城邮政编码的方法。

suppressPackageStartupMessages({
  library(rvest)
  library(magrittr)
})

scraper <- function(link) {
  link %>%
    read_html() %>%
    html_table() %>%
    `[[`(1)
}

url <- "https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/"

page <- read_html(url)
zip_codes_list <- page %>%
  html_elements("a") %>%
  html_attr("href") %>%
  grep("mexico/Ciudad-de-Mexico/.+", ., value = TRUE) %>%
  lapply(scraper)

然后将它们全部合并在一起。

zip_codes <- do.call(rbind, zip_codes_list)

编辑

在原始帖子中,我加载了dplyr包。经过再次考虑,我意识到它只是加载了相关包magrittr,以使magrittr管道运算符可用,因此我已更改代码只加载相关包magrittr

英文:

Here is a way to scrape the zip codes of Ciudad-de-Mexico.

suppressPackageStartupMessages({
  library(rvest)
  library(magrittr)
})

scraper &lt;- function(link) {
  link %&gt;%
    read_html() %&gt;%
    html_table() %&gt;%
    `[[`(1)
}

url &lt;- &quot;https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/&quot;

page &lt;- read_html(url)
zip_codes_list &lt;- page %&gt;%
  html_elements(&quot;a&quot;) %&gt;%
  html_attr(&quot;href&quot;) %&gt;%
  grep(&quot;mexico/Ciudad-de-Mexico/.+&quot;, ., value = TRUE) %&gt;%
  lapply(scraper)

Then rbind them all together.

zip_codes &lt;- do.call(rbind, zip_codes_list)

Edit

In the original post I have loaded package dplyr. After a second thought I have realized that it's only loaded to make the magrittr pipe operator available, so I have changed the code to only load the relevant package, magrittr.

答案2

得分: 1

library(tidyverse)
library(rvest)

"https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/" %>%
  read_html() %>%
  html_elements(".ctrLink a") %>%
  html_attr("href") %>% # 从主URL中获取所有链接
  map_dfr(~ .x %>% # 遍历市政府,抓取表格并合并行
    read_html() %>%
    html_table())

# 最后考虑 janitor::clean_names()

# 一个数据框:2,014 行 x 3 列
#   市政府           地区                                     邮政编码
#   <chr>         <chr>                                       <int>
# 1 阿尔瓦罗·奥布雷贡 第1扩建总统                    1299
# 2 阿尔瓦罗·奥布雷贡 第1部分Cañada                1269
# 3 阿尔瓦罗·奥布雷贡 第1胜利                         1160
# 4 阿尔瓦罗·奥布雷贡 第2扩建总统                    1299
# 5 阿尔瓦罗·奥布雷贡 第2 Del Moral del Pueblo de Tetelpan 1700
# 6 阿尔瓦罗·奥布雷贡 第2部分Cañada               1269
# 7 阿尔瓦罗·奥布雷贡 第2 Reacomodo Tlacuitlapa  1650
# 8 阿尔瓦罗·奥布雷贡 8月8日                        1180
# 9 阿尔瓦罗·奥布雷贡 云杉                              1440
# 10 阿尔瓦罗·奥布雷贡 亚伯拉罕·M.冈萨雷斯     1170
# … 其他 2,004 行
# 通过 `print(n = ...)` 查看更多行
英文:
library(tidyverse)
library(rvest)

&quot;https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/&quot; %&gt;%  
  read_html() %&gt;%  
  html_elements(&quot;.ctrLink a&quot;) %&gt;%  
  html_attr(&quot;href&quot;) %&gt;% # Grab all links from the main URL 
  map_dfr(~ .x %&gt;% # Map through municipalities, scrape tables and row bind
            read_html() %&gt;% 
            html_table())

# Consider janitor::clean_names() at the end


# A tibble: 2,014 &#215; 3
   Municipio      Localidad                           `C&#243;digo Postal`
   &lt;chr&gt;          &lt;chr&gt;                                         &lt;int&gt;
 1 Alvaro Obregon 1a Ampliaci&#243;n Presidentes                      1299
 2 Alvaro Obregon 1a Secci&#243;n Ca&#241;ada                              1269
 3 Alvaro Obregon 1a Victoria                                    1160
 4 Alvaro Obregon 2a Ampliaci&#243;n Presidentes                      1299
 5 Alvaro Obregon 2a Del Moral del Pueblo de Tetelpan            1700
 6 Alvaro Obregon 2a Secci&#243;n Ca&#241;ada                              1269
 7 Alvaro Obregon 2o Reacomodo Tlacuitlapa                       1650
 8 Alvaro Obregon 8 de Agosto                                    1180
 9 Alvaro Obregon Abeto                                          1440
10 Alvaro Obregon Abraham M. Gonz&#225;lez                            1170
# … with 2,004 more rows
# ℹ Use `print(n = ...)` to see more rows

huangapple
  • 本文由 发表于 2023年2月27日 10:14:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/75576268.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定