2023年2月27日 10:14:03go评论167阅读模式

英文:

web scraping with loop

问题

我试图下载不同页面上的邮政编码。我从墨西哥城内每个自治市的节点列表开始。

url <- "https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/"
resource <- GET(url)
parse <- htmlParse(resource)
links <- as.character(xpathSApply(parse, path = "//a", xmlGetAttr, "href"))
print(links)

然后，我尝试创建一个循环，获取每个 URL 并提取邮政编码表格，以后创建每个自治市创建的矩阵的大语料库。

scraper <- function(url){
  html <- read_html(url)
  tabla <- html %>%
    html_elements("td , th") %>%
    html_text2()
  data <- matrix(ncol = 3, nrow = length(tabla))
  data <- data.frame(matrix(tabla, nrow = length(tabla), ncol = 3, byrow = TRUE)) %>%
    row_to_names(row_number = 1)
}

我将拥有 "municipality"、"locality" 和 "zp"，因此列数为 3，但似乎出现了错误："Error: x must be a string of length 1"，而且我也无法合并所有矩阵。有任何想法都会非常感激！

英文:

I'm trying to download zip codes that are in different pages. I started with a list of nodes for each municipality inside Mexico City.

url&lt;-&quot;https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/&quot;
resource&lt;-GET(url)
parse&lt;-htmlParse(resource)
links&lt;-as.character(xpathSApply(parse,path=&quot;//a&quot;,xmlGetAttr,&quot;href&quot;))
print(links)

And then I'm trying to create a loop that grabs each url and grabs the table of zip codes to later create a big corpus of each matrix created per municipality:

scraper&lt;-function(url){
  html&lt;-read_html(url)
  tabla&lt;-html%&gt;%
    html_elements(&quot;td , th&quot;) %&gt;%
    html_text2()
  data&lt;-matrix(ncol=3,nrow=length(tabla))
  data&lt;-data.frame(matrix(tabla,nrow=length(tabla),ncol=3,byrow=TRUE)) %&gt;% 
    row_to_names(row_number=1)
}

I will have "municipality", "locality", "zp", that's why the number of columns is 3, but it seems that:
"Error: x must be a string of length 1" and I also cannot add up all the matrices.
Any ideas are greatly appreciated!

答案1

得分: 1

这是一种获取墨西哥城邮政编码的方法。

suppressPackageStartupMessages({
  library(rvest)
  library(magrittr)
})

scraper <- function(link) {
  link %>%
    read_html() %>%
    html_table() %>%
    `[[`(1)
}

url <- "https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/"

page <- read_html(url)
zip_codes_list <- page %>%
  html_elements("a") %>%
  html_attr("href") %>%
  grep("mexico/Ciudad-de-Mexico/.+", ., value = TRUE) %>%
  lapply(scraper)

然后将它们全部合并在一起。

zip_codes <- do.call(rbind, zip_codes_list)

编辑

在原始帖子中，我加载了dplyr包。经过再次考虑，我意识到它只是加载了相关包magrittr，以使magrittr管道运算符可用，因此我已更改代码只加载相关包magrittr。

英文:

Here is a way to scrape the zip codes of Ciudad-de-Mexico.

suppressPackageStartupMessages({
  library(rvest)
  library(magrittr)
})

scraper &lt;- function(link) {
  link %&gt;%
    read_html() %&gt;%
    html_table() %&gt;%
    `[[`(1)
}

url &lt;- &quot;https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/&quot;

page &lt;- read_html(url)
zip_codes_list &lt;- page %&gt;%
  html_elements(&quot;a&quot;) %&gt;%
  html_attr(&quot;href&quot;) %&gt;%
  grep(&quot;mexico/Ciudad-de-Mexico/.+&quot;, ., value = TRUE) %&gt;%
  lapply(scraper)

Then rbind them all together.

zip_codes &lt;- do.call(rbind, zip_codes_list)

Edit

In the original post I have loaded package dplyr. After a second thought I have realized that it's only loaded to make the magrittr pipe operator available, so I have changed the code to only load the relevant package, magrittr.

答案2

得分: 1

library(tidyverse)
library(rvest)

"https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/" %>%
  read_html() %>%
  html_elements(".ctrLink a") %>%
  html_attr("href") %>% # 从主URL中获取所有链接
  map_dfr(~ .x %>% # 遍历市政府，抓取表格并合并行
    read_html() %>%
    html_table())

# 最后考虑 janitor::clean_names()

# 一个数据框：2,014 行 x 3 列
#   市政府           地区                                     邮政编码
#   <chr>         <chr>                                       <int>
# 1 阿尔瓦罗·奥布雷贡 第1扩建总统                    1299
# 2 阿尔瓦罗·奥布雷贡 第1部分Cañada                1269
# 3 阿尔瓦罗·奥布雷贡 第1胜利                         1160
# 4 阿尔瓦罗·奥布雷贡 第2扩建总统                    1299
# 5 阿尔瓦罗·奥布雷贡 第2 Del Moral del Pueblo de Tetelpan 1700
# 6 阿尔瓦罗·奥布雷贡 第2部分Cañada               1269
# 7 阿尔瓦罗·奥布雷贡 第2 Reacomodo Tlacuitlapa  1650
# 8 阿尔瓦罗·奥布雷贡 8月8日                        1180
# 9 阿尔瓦罗·奥布雷贡 云杉                              1440
# 10 阿尔瓦罗·奥布雷贡 亚伯拉罕·M.冈萨雷斯     1170
# … 其他 2,004 行
# 通过 `print(n = ...)` 查看更多行

英文:

library(tidyverse)
library(rvest)

&quot;https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/&quot; %&gt;%  
  read_html() %&gt;%  
  html_elements(&quot;.ctrLink a&quot;) %&gt;%  
  html_attr(&quot;href&quot;) %&gt;% # Grab all links from the main URL 
  map_dfr(~ .x %&gt;% # Map through municipalities, scrape tables and row bind
            read_html() %&gt;% 
            html_table())

# Consider janitor::clean_names() at the end


# A tibble: 2,014 &#215; 3
   Municipio      Localidad                           `C&#243;digo Postal`
   &lt;chr&gt;          &lt;chr&gt;                                         &lt;int&gt;
 1 Alvaro Obregon 1a Ampliaci&#243;n Presidentes                      1299
 2 Alvaro Obregon 1a Secci&#243;n Ca&#241;ada                              1269
 3 Alvaro Obregon 1a Victoria                                    1160
 4 Alvaro Obregon 2a Ampliaci&#243;n Presidentes                      1299
 5 Alvaro Obregon 2a Del Moral del Pueblo de Tetelpan            1700
 6 Alvaro Obregon 2a Secci&#243;n Ca&#241;ada                              1269
 7 Alvaro Obregon 2o Reacomodo Tlacuitlapa                       1650
 8 Alvaro Obregon 8 de Agosto                                    1180
 9 Alvaro Obregon Abeto                                          1440
10 Alvaro Obregon Abraham M. Gonz&#225;lez                            1170
# … with 2,004 more rows
# ℹ Use `print(n = ...)` to see more rows

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

网页抓取循环

问题

答案1

编辑

Edit

答案2

How to run a AR query for multiple arguments of a table or list (lets say we have a column with IDs) in R

在调用函数时在innerHTML属性中显示MathJax。

R Shapefile未正确绘制纬度/经度

如何使用CSS创建类似设计的布局？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论