在链接中查找子字符串

huangapple go评论102阅读模式
英文:

Finding a Substring in A Link

问题

在我的Python函数中,我传递了一个URL,然后搜索该URL上的PDF文件,然后下载这些文件。对于大多数情况,它都能正常工作。

然而,当我尝试在特定课程网站上使用我的函数时,我的 current_link 是:

  1. /courses/320241/2019_2/lectures/lecture_7_8.pdf

尽管它应该自动检测并且只应该是:

  1. lectures/lecture_7_8.pdf

因为我将它们都连接在一起,链接的一部分被重复,下载的文件损坏了。我如何检查 current_link 是否包含从 my_url 重复的部分,如果是的话,如何在下载之前将其删除?

英文:

So in my Python function, I pass on a url, search for pdf files on that url and then download those files. For most cases, it works perfectly.

  1. def get_pdfs(my_url):
  2. html = urllib2.urlopen(my_url).read()
  3. html_page = BeautifulSoup(html)
  4. current_link = ''
  5. links = []
  6. for link in html_page.find_all('a'):
  7. current_link = link.get('href')
  8. if current_link.endswith('pdf'):
  9. print(current_link)
  10. links.append(my_url + current_link)
  11. #print(links)
  12. for link in links:
  13. #urlretrieve(link)
  14. wget.download(link)
  15. get_pdfs('https://grader.eecs.jacobs-university.de/courses/320241/2019_2/')

However when I try using my function for a particular course website, my current_link is

  1. /courses/320241/2019_2/lectures/lecture_7_8.pdf

though it should be automatically detected & should be only

  1. lectures/lecture_7_8.pdf

while the original my_url that I passed on to the function was

  1. https://grader.eecs.jacobs-university.de/courses/320241/2019_2/

since I'm appending both of them & a part of the link is repeated, the files downloaded are corrupted. How can I check current_link if any part is repeated from my_url and if yes, how can I remove it before downloading?

答案1

得分: 1

使用urllib.parse中的urljoin进行更新会完成任务:

  1. import urllib.parse
  2. from urllib.request import urlopen
  3. from bs4 import BeautifulSoup
  4. import wget
  5. def get_pdfs(my_url):
  6. html = urlopen(my_url).read()
  7. html_page = BeautifulSoup(html, 'html.parser')
  8. current_link = ''
  9. links = []
  10. for link in html_page.find_all('a'):
  11. current_link = link.get('href')
  12. if current_link.endswith('pdf'):
  13. print(current_link)
  14. links.append(urllib.parse.urljoin(my_url, current_link))
  15. for link in links:
  16. wget.download(link)

简化方法,使用.select('a[href$=pdf]')选择所有href以pdf结尾的链接:

  1. import urllib.parse
  2. from urllib.request import urlopen
  3. from bs4 import BeautifulSoup
  4. import wget
  5. def get_pdfs(my_url):
  6. html = urlopen(my_url).read()
  7. html_page = BeautifulSoup(html, 'html.parser')
  8. [wget.download(urllib.parse.urljoin(my_url, link.get('href'))) for link in html_page.select('a[href$=pdf]')]
英文:

Update using urljoin from urllib.parse will do job:

  1. import urllib.parse import urljoin
  2. def get_pdfs(my_url):
  3. html = urllib2.urlopen(my_url).read()
  4. html_page = BeautifulSoup(html)
  5. current_link = ''
  6. links = []
  7. for link in html_page.find_all('a'):
  8. current_link = link.get('href')
  9. if current_link.endswith('pdf'):
  10. print(current_link)
  11. links.append(urljoin(my_url, current_link))
  12. #print(links)
  13. for link in links:
  14. #urlretrieve(link)
  15. wget.download(link)

Simplified method, .select('a[href$=pdf]') select all links in which the href ends with pdf:

  1. import urllib.parse import urljoin
  2. def get_pdfs(my_url):
  3. html = urllib2.urlopen(my_url).read()
  4. html_page = BeautifulSoup(html)
  5. [wget.download(urljoin(my_url, link.get('href'))) for link in html_page.select('a[href$=pdf]')]

huangapple
  • 本文由 发表于 2020年1月6日 21:29:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/59613004.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定