英文:
how to download pdf file with R from web (encode issue)
问题
我正在尝试使用R从一个网站下载PDF文件。当我尝试使用browserURL
函数时,只有在设置encodeIfNeeded = T
参数时才有效。因此,如果我将相同的URL传递给download.file
函数,它会返回错误信息:“无法打开destfile 'downloaded/teste.pdf',原因是'找不到文件或目录'”,即它找不到正确的URL。
如何修复编码,以便我能够以编程方式下载文件?
我需要自动化这个过程,因为有一千多个文件需要下载。
这是一个最小可重现的代码示例:
library(tidyverse)
library(rvest)
url <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html"
webpage <- read_html(url)
# 抓取超链接
links_decisoes <- html_nodes(webpage, ".borderTD a") %>%
html_attr("href")
# 创建完整/正确的URL
full_links <- paste("http://www.ouvidoriageral.sp.gov.br/", links_decisoes, sep="")
# browseURL只在encodeIfNeeded = T时有效
browseURL(full_links[1], encodeIfNeeded = T,
browser = "C://Program Files//Mozilla Firefox//firefox.exe")
# 返回错误
download.file(full_links[1], "downloaded/teste.pdf")
英文:
I am trying to download a pdf file from a website using R. When I tried to to use the function browserURL, it only worked with the argument encodeIfNeeded = T. As a result, if I pass the same url to the function download.file, it returns "cannot open destfile 'downloaded/teste.pdf', reason 'No such file or directory", i.e., it cant find the correct url.
How do I correct the encode, in order for me to be able to download the file programatically?
I need to automate this, because there are more than a thousand files to download.
Here's a minimum reproducible code:
library(tidyverse)
library(rvest)
url <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html"
webpage <- read_html(url)
# scrapping hyperlinks
links_decisoes <- html_nodes(webpage,".borderTD a") %>%
html_attr("href")
# creating full/correct url
full_links <- paste("http://www.ouvidoriageral.sp.gov.br/", links_decisoes, sep="" )
# browseURL only works with encodeIfNeeded = T
browseURL(full_links[1], encodeIfNeeded = T,
browser = "C://Program Files//Mozilla Firefox//firefox.exe")
# returns an error
download.file(full_links[1], "downloaded/teste.pdf")
答案1
得分: 3
这里存在一些问题。首先,一些文件的链接没有正确格式化为URL - 它们包含空格和其他特殊字符。为了转换它们,您必须使用 url_escape()
,这应该对您可用,因为加载 rvest 也加载了 xml2,其中包含 url_escape()
。
其次,您保存的路径是相对于您的 R 主目录的,但您没有告诉 R。您可以使用完整路径,例如:"C://Users/Manoel/Documents/downloaded/testes.pdf"
,或者使用相对路径,例如:path.expand("~/downloaded/testes.pdf")
。
以下代码应该满足您的需求:
library(tidyverse)
library(rvest)
# 抓取超链接
full_links <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html" %>%
read_html() %>%
html_nodes(".borderTD a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("http://www.ouvidoriageral.sp.gov.br/", .)}
# 在Firefox中查看页面
browseURL(full_links[1], encodeIfNeeded = TRUE, browser = "firefox.exe")
# 如果存在 "downloaded" 文件夹,则保存 PDF 文件
download.file(full_links[1], path.expand("~/downloaded/testes.pdf"))
英文:
There are a couple of problems here. Firstly, the links to some of the files are not properly formatted as urls - they contain spaces and other special characters. In order to convert them you must use url_escape()
, which should be available to you as loading rvest also loads xml2, which contains url_escape()
.
Secondly, the path you are saving to is relative to your R home directory, but you are not telling R this. You either need the full path like this: "C://Users/Manoel/Documents/downloaded/testes.pdf"
, or a relative path like this: path.expand("~/downloaded/testes.pdf")
.
This code should do what you need:
library(tidyverse)
library(rvest)
# scraping hyperlinks
full_links <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html" %>%
read_html() %>%
html_nodes(".borderTD a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("http://www.ouvidoriageral.sp.gov.br/", .)}
# Looks at page in firefox
browseURL(full_links[1], encodeIfNeeded = T, browser = "firefox.exe")
# Saves pdf to "downloaded" folder if it exists
download.file(full_links[1], path.expand("~/downloaded/teste.pdf"))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论