如何使用R从网上下载PDF文件(编码问题)

huangapple go评论82阅读模式
英文:

how to download pdf file with R from web (encode issue)

问题

我正在尝试使用R从一个网站下载PDF文件。当我尝试使用browserURL函数时,只有在设置encodeIfNeeded = T参数时才有效。因此,如果我将相同的URL传递给download.file函数,它会返回错误信息:“无法打开destfile 'downloaded/teste.pdf',原因是'找不到文件或目录'”,即它找不到正确的URL。

如何修复编码,以便我能够以编程方式下载文件?
我需要自动化这个过程,因为有一千多个文件需要下载。

这是一个最小可重现的代码示例:

library(tidyverse)
library(rvest)

url <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html"
webpage <- read_html(url)

# 抓取超链接
links_decisoes <- html_nodes(webpage, ".borderTD a") %>%
  html_attr("href")

# 创建完整/正确的URL
full_links <- paste("http://www.ouvidoriageral.sp.gov.br/", links_decisoes, sep="")

# browseURL只在encodeIfNeeded = T时有效
browseURL(full_links[1], encodeIfNeeded = T,
          browser = "C://Program Files//Mozilla Firefox//firefox.exe")

# 返回错误
download.file(full_links[1], "downloaded/teste.pdf") 
英文:

I am trying to download a pdf file from a website using R. When I tried to to use the function browserURL, it only worked with the argument encodeIfNeeded = T. As a result, if I pass the same url to the function download.file, it returns "cannot open destfile 'downloaded/teste.pdf', reason 'No such file or directory", i.e., it cant find the correct url.

How do I correct the encode, in order for me to be able to download the file programatically?
I need to automate this, because there are more than a thousand files to download.

Here's a minimum reproducible code:

library(tidyverse)
library(rvest)


url &lt;- &quot;http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html&quot;
webpage &lt;- read_html(url)

# scrapping hyperlinks
links_decisoes &lt;- html_nodes(webpage,&quot;.borderTD a&quot;) %&gt;%
  html_attr(&quot;href&quot;)

# creating full/correct url
full_links &lt;- paste(&quot;http://www.ouvidoriageral.sp.gov.br/&quot;, links_decisoes, sep=&quot;&quot; )
 
# browseURL only works with encodeIfNeeded = T
browseURL(full_links[1], encodeIfNeeded = T,
          browser = &quot;C://Program Files//Mozilla Firefox//firefox.exe&quot;)
# returns an error
download.file(full_links[1], &quot;downloaded/teste.pdf&quot;) 

答案1

得分: 3

这里存在一些问题。首先,一些文件的链接没有正确格式化为URL - 它们包含空格和其他特殊字符。为了转换它们,您必须使用 url_escape(),这应该对您可用,因为加载 rvest 也加载了 xml2,其中包含 url_escape()

其次,您保存的路径是相对于您的 R 主目录的,但您没有告诉 R。您可以使用完整路径,例如:"C://Users/Manoel/Documents/downloaded/testes.pdf",或者使用相对路径,例如:path.expand("~/downloaded/testes.pdf")

以下代码应该满足您的需求:

library(tidyverse)
library(rvest)

# 抓取超链接
full_links <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html" %>%
               read_html() %>%
               html_nodes(".borderTD a") %>%
               html_attr("href") %>%
               url_escape() %>%
               {paste0("http://www.ouvidoriageral.sp.gov.br/", .)}

# 在Firefox中查看页面
browseURL(full_links[1], encodeIfNeeded = TRUE, browser = "firefox.exe")

# 如果存在 "downloaded" 文件夹,则保存 PDF 文件
download.file(full_links[1], path.expand("~/downloaded/testes.pdf"))
英文:

There are a couple of problems here. Firstly, the links to some of the files are not properly formatted as urls - they contain spaces and other special characters. In order to convert them you must use url_escape(), which should be available to you as loading rvest also loads xml2, which contains url_escape().

Secondly, the path you are saving to is relative to your R home directory, but you are not telling R this. You either need the full path like this: &quot;C://Users/Manoel/Documents/downloaded/testes.pdf&quot;, or a relative path like this: path.expand(&quot;~/downloaded/testes.pdf&quot;).

This code should do what you need:

library(tidyverse)
library(rvest)

# scraping hyperlinks
full_links &lt;- &quot;http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html&quot; %&gt;%
               read_html()                                           %&gt;%
               html_nodes(&quot;.borderTD a&quot;)                             %&gt;%
               html_attr(&quot;href&quot;)                                     %&gt;%
               url_escape()                                          %&gt;%
               {paste0(&quot;http://www.ouvidoriageral.sp.gov.br/&quot;, .)}

# Looks at page in firefox
browseURL(full_links[1], encodeIfNeeded = T, browser = &quot;firefox.exe&quot;)

# Saves pdf to &quot;downloaded&quot; folder if it exists
download.file(full_links[1], path.expand(&quot;~/downloaded/teste.pdf&quot;))

huangapple
  • 本文由 发表于 2020年1月7日 02:41:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/59617302.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定