如何从网页中抓取所有链接的链接并向下滚动

huangapple go评论49阅读模式
英文:

How to scrape the link of all links from a webpage and scroll down

问题

以下是您提供的代码的翻译部分:

我正在从某个网站的特定类别页面中提取新闻项该页面没有加载更多按钮相反随着我向下滚动新闻故事的链接会产生我创建了一个接受类别页面URL和限制页面我想要向下滚动的次数作为输入的函数并返回显示在该页面上的所有新闻项目的链接它以前可以正常运行但最近我发现在每个页面中每个新闻链接都位于不同的类中这使得将它们一起检索非常困难也许我错了我确信会有一种更简单的方法

类别页面链接= https://www.scmp.com/topics/currencies

这是我的尝试

在这段代码中,您尝试从特定网页中提取新闻链接,但遇到了一些问题,无法滚动或获取链接。如果您需要帮助,可以提供更多详细信息,以便我可以为您提供更多建议。

英文:

I am extracting news items from a certain category page of a website in which there is no load-more button; instead, the links to news stories are produced as I scroll down. I created a function that accepts category page url and limit page (the number of times I want to scroll down) as inputs and returns all the links to the news items shown on that page. It was running properly earlier recently I found in each page each news links are in different class which is making very difficult to retirve them back all together. Maybe I'm wrong and I do believe there would be a simpler method.

Category page link = https://www.scmp.com/topics/currencies

This was my try!

def get_article_links(url, limit_loading):
    
    options = webdriver.ChromeOptions()
    
    lists = ['disable-popup-blocking']

    caps = DesiredCapabilities().CHROME
    caps["pageLoadStrategy"] = "normal"

    options.add_argument("--window-size=1920,1080")
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-notifications")
    options.add_argument("--disable-Advertisement")
    options.add_argument("--disable-popup-blocking")
    
    driver = webdriver.Chrome(executable_path= r"E:\chromedriver\chromedriver.exe", options=options) #add your chrome path

    
    driver.get(url)
    last_height = driver.execute_script("return document.body.scrollHeight")
    
    loading = 0
    end_div = driver.find_element('class name','topic-content__load-more-anchor')
    while loading < limit_loading:
        loading += 1
        print(f'scrolling to page {loading}...')        
        end_div.location_once_scrolled_into_view
        time.sleep(2)
        
        
    article_links = []
    bsObj = BeautifulSoup(driver.page_source, 'html.parser')
    for i in bsObj.find('div', {'class': 'content-box'}).find('div', {'class': 'topic-article-container'}).find_all('h2', {'class': 'article__title'}):
        article_links.append(i.a['href'])
    
    return article_links

Assuming I want to scroll 3 times in this category page and get back all the links on those pages.

get_article_links('https://www.scmp.com/topics/currencies', 3)

But it is neither scrolling nor getting me back the links as the problem faced by me. Any help with this will be really appreciated. Thanks~~

答案1

得分: 1

这个问题花了一些时间才解决!

显然,如果不先点击页面,就没有办法滚动到页面底部。而且只需按下END键就容易多了。我从这里获得了发送按键的想法(https://stackoverflow.com/questions/27775759/send-keys-control-click-in-selenium-with-python-bindings)。

滚动后,寻找包含所有文章的div,它的类是'css-1279nek ef7usvb8'。
然后使用正则表达式匹配所有href,仅包括网页地址。最后使用set()来去除重复项。

    driver.get(url)
    elem = driver.find_element_by_xpath("//body")
    for i in range(limit_loading):
        action = ActionChains(driver)
        action.click().perform()
        elem.send_keys(Keys.END)
        time.sleep(5)
        
    bsObj = BeautifulSoup(driver.page_source, 'html.parser')
    article_div = bsObj.find('div', {'class': "css-1279nek ef7usvb8"})
    article_links = list(set(re.findall('href="([\/A-z0-9.-]+)', str(article_div)))
英文:

This one took a while to solve!

Apparently there is no way of scrolling to the bottom of the page if you don't click it first. And it's also much easier to just send the END key.
I got the idea of sending keys from here.

After scrolling, look for the div that contains all articles, its class is 'css-1279nek ef7usvb8'.
Then use regex to match all href, including only the webpage address. Lastly use set() to get rid of duplicates.

    driver.get(url)
    elem = driver.find_element_by_xpath("//body")
    for i in range(limit_loading):
        action = ActionChains(driver)
        action.click().perform()
        elem.send_keys(Keys.END)
        time.sleep(5)
        
    bsObj = BeautifulSoup(driver.page_source, 'html.parser')
    article_div = bsObj.find('div', {'class': "css-1279nek ef7usvb8"})
    article_links = list(set(re.findall('href="([\/A-z0-9.-]+)', str(article_div))))

huangapple
  • 本文由 发表于 2023年2月16日 10:27:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75467250.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定