英文:
How can I download PDFs from a website that stores them on AWS using rvest in R
问题
从使用rvest存储在AWS上的网站下载PDF文件的问题
我正在尝试使用rvest
从这个政府网页下载约500份个别的PDF提交。该网站上的许多链接指向存储在单独的AWS网站上的PDF文件(例如,此文档 - 请参阅从'个人提交'部分开始的链接)。
当我下载这些PDF文件时,无法打开它们。我认为我实际上没有从AWS网站下载链接的PDF文件。这些链接不包括.pdf文件类型(例如,https://getinvolved.mdba.gov.au/22346/widgets/139364/documents/47013),我认为我可能错过了下载实际PDF文件的步骤。
这是一个可重现的示例
#加载包
library(tidyverse)
library(rvest)
library(polite)
# 抓取PDF链接和名称
mdba_NB_url <- "https://getinvolved.mdba.gov.au/bp-amendments-submissions/widgets/139364/documents"
session <- bow(mdba_NB_url, force = TRUE) # 从polite包中,标识和遵守任何明确的限制
NB_page <- scrape(session) # 抓取页面内容
download_links <- tibble(link_names = NB_page %>% #下载链接
html_nodes("a") %>%
html_text(),
link_urls = NB_page %>%
html_nodes("a") %>%
html_attr('href'))
#筛选PDF文件
download_links_docs <- download_links %>% #限制链接到我需要的PDF文件
filter(str_detect(link_names, "No. [0-9]"))
download_links_docs_subset <- download_links_docs %>% #为测试下载进行子集选择
slice(c(1:10))
# 下载PDF文件
my_urls <- download_links_docs_subset$link_urls
save_here <- paste0(download_links_docs_subset$link_names, ".pdf")
mapply(download.file, my_urls, save_here, mode = "wb")
请注意,这是您提供的代码的翻译部分。
英文:
Problem downloading pdfs from a website that stores them on AWS using rvest
I am trying to download ~500 individual PDF submissions from this government webpage using rvest
. Many of the links on the site point to PDFs stored on a separate AWS site (for example this document - see links from the 'Individual submissions' section onwards).
When I download the PDFs, I can't open them. I don't think I am actually downloading the linked PDFs from the AWS site. The links don't include a .pdf file type (e.g. https://getinvolved.mdba.gov.au/22346/widgets/139364/documents/47013) and I think I'm missing a step to download the actual PDFs.
Here is a reproducible example
#load packages
library(tidyverse)
library(rvest)
library(polite)
# scrape PDF links and names
mdba_NB_url <- "https://getinvolved.mdba.gov.au/bp-amendments-submissions/widgets/139364/documents"
session <- bow(mdba_NB_url, force = TRUE) # from the polite package, identify and respect any explicit limits
NB_page <- scrape(session) # scrape the page contents
download_links <- tibble(link_names = NB_page %>% #download links
html_nodes("a")%>%
html_text(),
link_urls = NB_page %>%
html_nodes("a") %>%
html_attr('href'))
#filter PDFs
download_links_docs <- download_links %>%. #limit links to PDFs I need
filter(str_detect(link_names, "No. [0-9]"))
download_links_docs_subset <- download_links_docs %>%. #subset for test download
slice(c(1:10))
# Download PDFs
my_urls <- download_links_docs_subset$link_urls
save_here <- paste0(download_links_docs_subset$link_names, ".pdf")
mapply(download.file, my_urls, save_here, mode = "wb")
答案1
得分: 0
链接确实在某种程度上被重定向了。但你可以相对容易地修复它。当它下载实际文件时,如果你查看网络分析,只需要在你的URL后面添加"/download"。
例如:
my_urls <- paste0(download_links_docs_subset$link_urls, "/download")
然后你可以使用httr
来下载它们。download.file
似乎会影响PDF编码。
像这样:
httr::GET(my_urls[1],
httr::write_disk("test.pdf", overwrite = TRUE))
英文:
The link is indeed somehow redirected. But you can relatively easily fix it. If you look at the network analysis when it downloads an actual file, you just need to append "/download" to your url.
e.g. so:
my_urls <- paste0(download_links_docs_subset$link_urls,"/download")
You can then download them using httr
. download.file
seems to mess with the PDF encoding.
Like so:
httr::GET(my_urls[1],
httr::write_disk("test.pdf", overwrite = T))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论