在链接中查找子字符串

huangapple go评论81阅读模式
英文:

Finding a Substring in A Link

问题

在我的Python函数中,我传递了一个URL,然后搜索该URL上的PDF文件,然后下载这些文件。对于大多数情况,它都能正常工作。

然而,当我尝试在特定课程网站上使用我的函数时,我的 current_link 是:

/courses/320241/2019_2/lectures/lecture_7_8.pdf

尽管它应该自动检测并且只应该是:

lectures/lecture_7_8.pdf

因为我将它们都连接在一起,链接的一部分被重复,下载的文件损坏了。我如何检查 current_link 是否包含从 my_url 重复的部分,如果是的话,如何在下载之前将其删除?

英文:

So in my Python function, I pass on a url, search for pdf files on that url and then download those files. For most cases, it works perfectly.

def get_pdfs(my_url):
    html = urllib2.urlopen(my_url).read()
    html_page = BeautifulSoup(html)
    current_link = ''
    links = []
    for link in html_page.find_all('a'):
        current_link = link.get('href')
        if current_link.endswith('pdf'):
            print(current_link)
            links.append(my_url + current_link)
    #print(links)

    for link in links:
        #urlretrieve(link)
        wget.download(link)


get_pdfs('https://grader.eecs.jacobs-university.de/courses/320241/2019_2/')

However when I try using my function for a particular course website, my current_link is

/courses/320241/2019_2/lectures/lecture_7_8.pdf

though it should be automatically detected & should be only

lectures/lecture_7_8.pdf

while the original my_url that I passed on to the function was

https://grader.eecs.jacobs-university.de/courses/320241/2019_2/

since I'm appending both of them & a part of the link is repeated, the files downloaded are corrupted. How can I check current_link if any part is repeated from my_url and if yes, how can I remove it before downloading?

答案1

得分: 1

使用urllib.parse中的urljoin进行更新会完成任务:

import urllib.parse
from urllib.request import urlopen
from bs4 import BeautifulSoup
import wget

def get_pdfs(my_url):
    html = urlopen(my_url).read()
    html_page = BeautifulSoup(html, 'html.parser')
    current_link = ''
    links = []
    for link in html_page.find_all('a'):
        current_link = link.get('href')
        if current_link.endswith('pdf'):
            print(current_link)
            links.append(urllib.parse.urljoin(my_url, current_link))
    
    for link in links:
        wget.download(link)

简化方法,使用.select('a[href$=pdf]')选择所有href以pdf结尾的链接:

import urllib.parse
from urllib.request import urlopen
from bs4 import BeautifulSoup
import wget

def get_pdfs(my_url):
    html = urlopen(my_url).read()
    html_page = BeautifulSoup(html, 'html.parser')
    [wget.download(urllib.parse.urljoin(my_url, link.get('href'))) for link in html_page.select('a[href$=pdf]')]
英文:

Update using urljoin from urllib.parse will do job:

import urllib.parse import urljoin

def get_pdfs(my_url):
    html = urllib2.urlopen(my_url).read()
    html_page = BeautifulSoup(html)
    current_link = ''
    links = []
    for link in html_page.find_all('a'):
        current_link = link.get('href')
        if current_link.endswith('pdf'):
            print(current_link)
            links.append(urljoin(my_url, current_link))
    #print(links)

    for link in links:
        #urlretrieve(link)
        wget.download(link)

Simplified method, .select('a[href$=pdf]') select all links in which the href ends with pdf:

import urllib.parse import urljoin

def get_pdfs(my_url):
    html = urllib2.urlopen(my_url).read()
    html_page = BeautifulSoup(html)
    [wget.download(urljoin(my_url, link.get('href'))) for link in html_page.select('a[href$=pdf]')]

huangapple
  • 本文由 发表于 2020年1月6日 21:29:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/59613004.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定