2020年1月6日 21:29:12go评论157阅读模式

英文:

Finding a Substring in A Link

问题

在我的Python函数中，我传递了一个URL，然后搜索该URL上的PDF文件，然后下载这些文件。对于大多数情况，它都能正常工作。

然而，当我尝试在特定课程网站上使用我的函数时，我的 current_link 是：

/courses/320241/2019_2/lectures/lecture_7_8.pdf

尽管它应该自动检测并且只应该是：

lectures/lecture_7_8.pdf

因为我将它们都连接在一起，链接的一部分被重复，下载的文件损坏了。我如何检查 current_link 是否包含从 my_url 重复的部分，如果是的话，如何在下载之前将其删除？

英文:

So in my Python function, I pass on a url, search for pdf files on that url and then download those files. For most cases, it works perfectly.

def get_pdfs(my_url):
    html = urllib2.urlopen(my_url).read()
    html_page = BeautifulSoup(html)
    current_link = &#39;&#39;
    links = []
    for link in html_page.find_all(&#39;a&#39;):
        current_link = link.get(&#39;href&#39;)
        if current_link.endswith(&#39;pdf&#39;):
            print(current_link)
            links.append(my_url + current_link)
    #print(links)

    for link in links:
        #urlretrieve(link)
        wget.download(link)


get_pdfs(&#39;https://grader.eecs.jacobs-university.de/courses/320241/2019_2/&#39;)

However when I try using my function for a particular course website, my current_link is

/courses/320241/2019_2/lectures/lecture_7_8.pdf

though it should be automatically detected & should be only

lectures/lecture_7_8.pdf

while the original my_url that I passed on to the function was

https://grader.eecs.jacobs-university.de/courses/320241/2019_2/

since I'm appending both of them & a part of the link is repeated, the files downloaded are corrupted. How can I check current_link if any part is repeated from my_url and if yes, how can I remove it before downloading?

答案1

得分: 1

使用urllib.parse中的urljoin进行更新会完成任务：

import urllib.parse
from urllib.request import urlopen
from bs4 import BeautifulSoup
import wget

def get_pdfs(my_url):
    html = urlopen(my_url).read()
    html_page = BeautifulSoup(html, 'html.parser')
    current_link = ''
    links = []
    for link in html_page.find_all('a'):
        current_link = link.get('href')
        if current_link.endswith('pdf'):
            print(current_link)
            links.append(urllib.parse.urljoin(my_url, current_link))
    
    for link in links:
        wget.download(link)

简化方法，使用.select('a[href$=pdf]')选择所有href以pdf结尾的链接：

import urllib.parse
from urllib.request import urlopen
from bs4 import BeautifulSoup
import wget

def get_pdfs(my_url):
    html = urlopen(my_url).read()
    html_page = BeautifulSoup(html, 'html.parser')
    [wget.download(urllib.parse.urljoin(my_url, link.get('href'))) for link in html_page.select('a[href$=pdf]')]

英文:

Update using urljoin from urllib.parse will do job:

import urllib.parse import urljoin

def get_pdfs(my_url):
    html = urllib2.urlopen(my_url).read()
    html_page = BeautifulSoup(html)
    current_link = &#39;&#39;
    links = []
    for link in html_page.find_all(&#39;a&#39;):
        current_link = link.get(&#39;href&#39;)
        if current_link.endswith(&#39;pdf&#39;):
            print(current_link)
            links.append(urljoin(my_url, current_link))
    #print(links)

    for link in links:
        #urlretrieve(link)
        wget.download(link)

Simplified method, .select('a[href$=pdf]') select all links in which the href ends with pdf:

import urllib.parse import urljoin

def get_pdfs(my_url):
    html = urllib2.urlopen(my_url).read()
    html_page = BeautifulSoup(html)
    [wget.download(urljoin(my_url, link.get(&#39;href&#39;))) for link in html_page.select(&#39;a[href$=pdf]&#39;)]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在链接中查找子字符串

问题

答案1

How can I arrange a dictionary using python so that the values are represented from lowest to highest order?

Pandas将一个列中的’None’值视为NaN，而在另一个列中视为’None’…?

pytest可以忽略特定的警告吗？

MultiIndex 在使用 pd.concat 时的名称消失

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论