2023年6月2日 08:08:29go评论106阅读模式

英文:

How can I download PDFs from a website that stores them on AWS using rvest in R

问题

从使用rvest存储在AWS上的网站下载PDF文件的问题

我正在尝试使用rvest从这个政府网页下载约500份个别的PDF提交。该网站上的许多链接指向存储在单独的AWS网站上的PDF文件（例如，此文档 - 请参阅从'个人提交'部分开始的链接)。

当我下载这些PDF文件时，无法打开它们。我认为我实际上没有从AWS网站下载链接的PDF文件。这些链接不包括.pdf文件类型（例如，https://getinvolved.mdba.gov.au/22346/widgets/139364/documents/47013），我认为我可能错过了下载实际PDF文件的步骤。

这是一个可重现的示例

   #加载包
    library(tidyverse)
    library(rvest)
    library(polite)
    
    # 抓取PDF链接和名称
    mdba_NB_url <- "https://getinvolved.mdba.gov.au/bp-amendments-submissions/widgets/139364/documents"
    
    session <- bow(mdba_NB_url, force = TRUE) # 从polite包中，标识和遵守任何明确的限制
    
    NB_page <- scrape(session) # 抓取页面内容
    
    download_links <- tibble(link_names = NB_page %>% #下载链接
                               html_nodes("a") %>%
                               html_text(),
                             link_urls = NB_page %>%
                               html_nodes("a") %>%
                               html_attr('href'))
    #筛选PDF文件
    download_links_docs <- download_links %>% #限制链接到我需要的PDF文件
      filter(str_detect(link_names, "No. [0-9]"))
    
    download_links_docs_subset <- download_links_docs %>% #为测试下载进行子集选择
      slice(c(1:10))
    
    # 下载PDF文件
    my_urls <- download_links_docs_subset$link_urls
    save_here <- paste0(download_links_docs_subset$link_names, ".pdf")
    mapply(download.file, my_urls, save_here, mode = "wb")

请注意，这是您提供的代码的翻译部分。

英文:

Problem downloading pdfs from a website that stores them on AWS using rvest

I am trying to download ~500 individual PDF submissions from this government webpage using rvest. Many of the links on the site point to PDFs stored on a separate AWS site (for example this document - see links from the 'Individual submissions' section onwards).

When I download the PDFs, I can't open them. I don't think I am actually downloading the linked PDFs from the AWS site. The links don't include a .pdf file type (e.g. https://getinvolved.mdba.gov.au/22346/widgets/139364/documents/47013) and I think I'm missing a step to download the actual PDFs.

Here is a reproducible example

   #load packages
    library(tidyverse)
    library(rvest)
    library(polite)
    
    # scrape PDF links and names
    mdba_NB_url &lt;- &quot;https://getinvolved.mdba.gov.au/bp-amendments-submissions/widgets/139364/documents&quot;
    
    session &lt;- bow(mdba_NB_url, force = TRUE) # from the polite package, identify and respect any explicit limits
    
    NB_page &lt;- scrape(session) # scrape the page contents
    
    download_links &lt;- tibble(link_names = NB_page %&gt;% #download links
                               html_nodes(&quot;a&quot;)%&gt;%
                               html_text(),
                             link_urls = NB_page %&gt;%
                               html_nodes(&quot;a&quot;) %&gt;%
                               html_attr(&#39;href&#39;))
    #filter PDFs
    download_links_docs &lt;- download_links %&gt;%. #limit links to PDFs I need
      filter(str_detect(link_names, &quot;No. [0-9]&quot;))
    
    download_links_docs_subset &lt;- download_links_docs %&gt;%. #subset for test download
      slice(c(1:10))
    
    # Download PDFs
    my_urls &lt;- download_links_docs_subset$link_urls
    save_here &lt;- paste0(download_links_docs_subset$link_names, &quot;.pdf&quot;)
    mapply(download.file, my_urls, save_here, mode = &quot;wb&quot;)

答案1

得分: 0

链接确实在某种程度上被重定向了。但你可以相对容易地修复它。当它下载实际文件时，如果你查看网络分析，只需要在你的URL后面添加"/download"。

例如：

my_urls <- paste0(download_links_docs_subset$link_urls, "/download")

然后你可以使用httr来下载它们。download.file 似乎会影响PDF编码。

像这样：

httr::GET(my_urls[1], 
httr::write_disk("test.pdf", overwrite = TRUE))

英文:

The link is indeed somehow redirected. But you can relatively easily fix it. If you look at the network analysis when it downloads an actual file, you just need to append "/download" to your url.

e.g. so:

my_urls &lt;- paste0(download_links_docs_subset$link_urls,&quot;/download&quot;)

You can then download them using httr. download.file seems to mess with the PDF encoding.

Like so:

httr::GET(my_urls[1], 
httr::write_disk(&quot;test.pdf&quot;, overwrite = T))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用R中的rvest从存储在AWS上的网站下载PDF文件。

问题

从使用rvest存储在AWS上的网站下载PDF文件的问题

这是一个可重现的示例

Problem downloading pdfs from a website that stores them on AWS using rvest

Here is a reproducible example

答案1

R闪亮的下拉列表用于呈现表格中的列

如何使shinydashboard中的一个选项卡不可见？

分割文本以固定长度，考虑分隔符。

Missing data using plot() in R: should I use na.omit(), !is.na(), approx()? If so then how?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。