问题

我正在尝试使用R从一个网站下载PDF文件。当我尝试使用browserURL函数时，只有在设置encodeIfNeeded = T参数时才有效。因此，如果我将相同的URL传递给download.file函数，它会返回错误信息：“无法打开destfile 'downloaded/teste.pdf'，原因是'找不到文件或目录'”，即它找不到正确的URL。

如何修复编码，以便我能够以编程方式下载文件？
我需要自动化这个过程，因为有一千多个文件需要下载。

这是一个最小可重现的代码示例：

library(tidyverse)
library(rvest)

url <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html"
webpage <- read_html(url)

# 抓取超链接
links_decisoes <- html_nodes(webpage, ".borderTD a") %>%
  html_attr("href")

# 创建完整/正确的URL
full_links <- paste("http://www.ouvidoriageral.sp.gov.br/", links_decisoes, sep="")

# browseURL只在encodeIfNeeded = T时有效
browseURL(full_links[1], encodeIfNeeded = T,
          browser = "C://Program Files//Mozilla Firefox//firefox.exe")

# 返回错误
download.file(full_links[1], "downloaded/teste.pdf")

英文:

I am trying to download a pdf file from a website using R. When I tried to to use the function browserURL, it only worked with the argument encodeIfNeeded = T. As a result, if I pass the same url to the function download.file, it returns "cannot open destfile 'downloaded/teste.pdf', reason 'No such file or directory", i.e., it cant find the correct url.

How do I correct the encode, in order for me to be able to download the file programatically?
I need to automate this, because there are more than a thousand files to download.

Here's a minimum reproducible code:

library(tidyverse)
library(rvest)


url &lt;- &quot;http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html&quot;
webpage &lt;- read_html(url)

# scrapping hyperlinks
links_decisoes &lt;- html_nodes(webpage,&quot;.borderTD a&quot;) %&gt;%
  html_attr(&quot;href&quot;)

# creating full/correct url
full_links &lt;- paste(&quot;http://www.ouvidoriageral.sp.gov.br/&quot;, links_decisoes, sep=&quot;&quot; )
 
# browseURL only works with encodeIfNeeded = T
browseURL(full_links[1], encodeIfNeeded = T,
          browser = &quot;C://Program Files//Mozilla Firefox//firefox.exe&quot;)
# returns an error
download.file(full_links[1], &quot;downloaded/teste.pdf&quot;)

答案1

得分: 3

这里存在一些问题。首先，一些文件的链接没有正确格式化为URL - 它们包含空格和其他特殊字符。为了转换它们，您必须使用 url_escape()，这应该对您可用，因为加载 rvest 也加载了 xml2，其中包含 url_escape()。

其次，您保存的路径是相对于您的 R 主目录的，但您没有告诉 R。您可以使用完整路径，例如："C://Users/Manoel/Documents/downloaded/testes.pdf"，或者使用相对路径，例如：path.expand("~/downloaded/testes.pdf")。

以下代码应该满足您的需求：

library(tidyverse)
library(rvest)

# 抓取超链接
full_links <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html" %>%
               read_html() %>%
               html_nodes(".borderTD a") %>%
               html_attr("href") %>%
               url_escape() %>%
               {paste0("http://www.ouvidoriageral.sp.gov.br/", .)}

# 在Firefox中查看页面
browseURL(full_links[1], encodeIfNeeded = TRUE, browser = "firefox.exe")

# 如果存在 "downloaded" 文件夹，则保存 PDF 文件
download.file(full_links[1], path.expand("~/downloaded/testes.pdf"))

英文:

There are a couple of problems here. Firstly, the links to some of the files are not properly formatted as urls - they contain spaces and other special characters. In order to convert them you must use url_escape(), which should be available to you as loading rvest also loads xml2, which contains url_escape().

Secondly, the path you are saving to is relative to your R home directory, but you are not telling R this. You either need the full path like this: "C://Users/Manoel/Documents/downloaded/testes.pdf", or a relative path like this: path.expand("~/downloaded/testes.pdf").

This code should do what you need:

library(tidyverse)
library(rvest)

# scraping hyperlinks
full_links &lt;- &quot;http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html&quot; %&gt;%
               read_html()                                           %&gt;%
               html_nodes(&quot;.borderTD a&quot;)                             %&gt;%
               html_attr(&quot;href&quot;)                                     %&gt;%
               url_escape()                                          %&gt;%
               {paste0(&quot;http://www.ouvidoriageral.sp.gov.br/&quot;, .)}

# Looks at page in firefox
browseURL(full_links[1], encodeIfNeeded = T, browser = &quot;firefox.exe&quot;)

# Saves pdf to &quot;downloaded&quot; folder if it exists
download.file(full_links[1], path.expand(&quot;~/downloaded/teste.pdf&quot;))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用R从网上下载PDF文件（编码问题）

问题

答案1

使用R创建组，使用30天的窗口。

如何在R中使用一列的值来定义另一列的边界值？

R函数用于将单元格中的逗号分隔值转换为具有相同行名称的多行数据。

使用readRDS()和哈希检索缓存的对象。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论