如何使用R中的rvest从存储在AWS上的网站下载PDF文件。

huangapple go评论76阅读模式
英文:

How can I download PDFs from a website that stores them on AWS using rvest in R

问题

从使用rvest存储在AWS上的网站下载PDF文件的问题

我正在尝试使用rvest这个政府网页下载约500份个别的PDF提交。该网站上的许多链接指向存储在单独的AWS网站上的PDF文件(例如,此文档 - 请参阅从'个人提交'部分开始的链接)。

当我下载这些PDF文件时,无法打开它们。我认为我实际上没有从AWS网站下载链接的PDF文件。这些链接不包括.pdf文件类型(例如,https://getinvolved.mdba.gov.au/22346/widgets/139364/documents/47013),我认为我可能错过了下载实际PDF文件的步骤。

这是一个可重现的示例

   #加载包

    library(tidyverse)
    library(rvest)
    library(polite)
    
    # 抓取PDF链接和名称

    mdba_NB_url <- "https://getinvolved.mdba.gov.au/bp-amendments-submissions/widgets/139364/documents"
    
    session <- bow(mdba_NB_url, force = TRUE) # 从polite包中,标识和遵守任何明确的限制
    
    NB_page <- scrape(session) # 抓取页面内容
    
    download_links <- tibble(link_names = NB_page %>% #下载链接
                               html_nodes("a") %>%
                               html_text(),
                             link_urls = NB_page %>%
                               html_nodes("a") %>%
                               html_attr('href'))
    #筛选PDF文件

    download_links_docs <- download_links %>% #限制链接到我需要的PDF文件
      filter(str_detect(link_names, "No. [0-9]"))
    
    download_links_docs_subset <- download_links_docs %>% #为测试下载进行子集选择
      slice(c(1:10))
    
    # 下载PDF文件

    my_urls <- download_links_docs_subset$link_urls
    save_here <- paste0(download_links_docs_subset$link_names, ".pdf")
    mapply(download.file, my_urls, save_here, mode = "wb")

请注意,这是您提供的代码的翻译部分。

英文:

Problem downloading pdfs from a website that stores them on AWS using rvest

I am trying to download ~500 individual PDF submissions from this government webpage using rvest. Many of the links on the site point to PDFs stored on a separate AWS site (for example this document - see links from the 'Individual submissions' section onwards).

When I download the PDFs, I can't open them. I don't think I am actually downloading the linked PDFs from the AWS site. The links don't include a .pdf file type (e.g. https://getinvolved.mdba.gov.au/22346/widgets/139364/documents/47013) and I think I'm missing a step to download the actual PDFs.

Here is a reproducible example

   #load packages

    library(tidyverse)
    library(rvest)
    library(polite)
    
    # scrape PDF links and names

    mdba_NB_url &lt;- &quot;https://getinvolved.mdba.gov.au/bp-amendments-submissions/widgets/139364/documents&quot;
    
    session &lt;- bow(mdba_NB_url, force = TRUE) # from the polite package, identify and respect any explicit limits
    
    NB_page &lt;- scrape(session) # scrape the page contents
    
    download_links &lt;- tibble(link_names = NB_page %&gt;% #download links
                               html_nodes(&quot;a&quot;)%&gt;%
                               html_text(),
                             link_urls = NB_page %&gt;%
                               html_nodes(&quot;a&quot;) %&gt;%
                               html_attr(&#39;href&#39;))
    #filter PDFs

    download_links_docs &lt;- download_links %&gt;%. #limit links to PDFs I need
      filter(str_detect(link_names, &quot;No. [0-9]&quot;))
    
    download_links_docs_subset &lt;- download_links_docs %&gt;%. #subset for test download
      slice(c(1:10))
    
    # Download PDFs

    my_urls &lt;- download_links_docs_subset$link_urls
    save_here &lt;- paste0(download_links_docs_subset$link_names, &quot;.pdf&quot;)
    mapply(download.file, my_urls, save_here, mode = &quot;wb&quot;)

答案1

得分: 0

链接确实在某种程度上被重定向了。但你可以相对容易地修复它。当它下载实际文件时,如果你查看网络分析,只需要在你的URL后面添加"/download"。

例如:

my_urls <- paste0(download_links_docs_subset$link_urls, "/download")

然后你可以使用httr来下载它们。download.file 似乎会影响PDF编码。

像这样:

httr::GET(my_urls[1], 
httr::write_disk("test.pdf", overwrite = TRUE))
英文:

The link is indeed somehow redirected. But you can relatively easily fix it. If you look at the network analysis when it downloads an actual file, you just need to append "/download" to your url.

e.g. so:

my_urls &lt;- paste0(download_links_docs_subset$link_urls,&quot;/download&quot;)

You can then download them using httr. download.file seems to mess with the PDF encoding.

Like so:

httr::GET(my_urls[1], 
httr::write_disk(&quot;test.pdf&quot;, overwrite = T))

huangapple
  • 本文由 发表于 2023年6月2日 08:08:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76386429.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定