英文:
Finding a Substring in A Link
问题
在我的Python函数中,我传递了一个URL,然后搜索该URL上的PDF文件,然后下载这些文件。对于大多数情况,它都能正常工作。
然而,当我尝试在特定课程网站上使用我的函数时,我的 current_link
是:
/courses/320241/2019_2/lectures/lecture_7_8.pdf
尽管它应该自动检测并且只应该是:
lectures/lecture_7_8.pdf
因为我将它们都连接在一起,链接的一部分被重复,下载的文件损坏了。我如何检查 current_link
是否包含从 my_url
重复的部分,如果是的话,如何在下载之前将其删除?
英文:
So in my Python function, I pass on a url, search for pdf files on that url and then download those files. For most cases, it works perfectly.
def get_pdfs(my_url):
html = urllib2.urlopen(my_url).read()
html_page = BeautifulSoup(html)
current_link = ''
links = []
for link in html_page.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
print(current_link)
links.append(my_url + current_link)
#print(links)
for link in links:
#urlretrieve(link)
wget.download(link)
get_pdfs('https://grader.eecs.jacobs-university.de/courses/320241/2019_2/')
However when I try using my function for a particular course website, my current_link
is
/courses/320241/2019_2/lectures/lecture_7_8.pdf
though it should be automatically detected & should be only
lectures/lecture_7_8.pdf
while the original my_url that I passed on to the function was
https://grader.eecs.jacobs-university.de/courses/320241/2019_2/
since I'm appending both of them & a part of the link is repeated, the files downloaded are corrupted. How can I check current_link
if any part is repeated from my_url
and if yes, how can I remove it before downloading?
答案1
得分: 1
使用urllib.parse
中的urljoin
进行更新会完成任务:
import urllib.parse
from urllib.request import urlopen
from bs4 import BeautifulSoup
import wget
def get_pdfs(my_url):
html = urlopen(my_url).read()
html_page = BeautifulSoup(html, 'html.parser')
current_link = ''
links = []
for link in html_page.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
print(current_link)
links.append(urllib.parse.urljoin(my_url, current_link))
for link in links:
wget.download(link)
简化方法,使用.select('a[href$=pdf]')
选择所有href以pdf结尾的链接:
import urllib.parse
from urllib.request import urlopen
from bs4 import BeautifulSoup
import wget
def get_pdfs(my_url):
html = urlopen(my_url).read()
html_page = BeautifulSoup(html, 'html.parser')
[wget.download(urllib.parse.urljoin(my_url, link.get('href'))) for link in html_page.select('a[href$=pdf]')]
英文:
Update using urljoin
from urllib.parse
will do job:
import urllib.parse import urljoin
def get_pdfs(my_url):
html = urllib2.urlopen(my_url).read()
html_page = BeautifulSoup(html)
current_link = ''
links = []
for link in html_page.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
print(current_link)
links.append(urljoin(my_url, current_link))
#print(links)
for link in links:
#urlretrieve(link)
wget.download(link)
Simplified method, .select('a[href$=pdf]')
select all links in which the href ends with pdf:
import urllib.parse import urljoin
def get_pdfs(my_url):
html = urllib2.urlopen(my_url).read()
html_page = BeautifulSoup(html)
[wget.download(urljoin(my_url, link.get('href'))) for link in html_page.select('a[href$=pdf]')]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论