问题

以下是您提供的代码的翻译部分：

我正在从某个网站的特定类别页面中提取新闻项，该页面没有“加载更多”按钮；相反，随着我向下滚动，新闻故事的链接会产生。我创建了一个接受类别页面URL和限制页面（我想要向下滚动的次数）作为输入的函数，并返回显示在该页面上的所有新闻项目的链接。它以前可以正常运行，但最近我发现在每个页面中，每个新闻链接都位于不同的类中，这使得将它们一起检索非常困难。也许我错了，我确信会有一种更简单的方法。

类别页面链接= https://www.scmp.com/topics/currencies

这是我的尝试！

在这段代码中，您尝试从特定网页中提取新闻链接，但遇到了一些问题，无法滚动或获取链接。如果您需要帮助，可以提供更多详细信息，以便我可以为您提供更多建议。

英文:

I am extracting news items from a certain category page of a website in which there is no load-more button; instead, the links to news stories are produced as I scroll down. I created a function that accepts category page url and limit page (the number of times I want to scroll down) as inputs and returns all the links to the news items shown on that page. It was running properly earlier recently I found in each page each news links are in different class which is making very difficult to retirve them back all together. Maybe I'm wrong and I do believe there would be a simpler method.

Category page link = https://www.scmp.com/topics/currencies

This was my try!

def get_article_links(url, limit_loading):
    
    options = webdriver.ChromeOptions()
    
    lists = [&#39;disable-popup-blocking&#39;]

    caps = DesiredCapabilities().CHROME
    caps[&quot;pageLoadStrategy&quot;] = &quot;normal&quot;

    options.add_argument(&quot;--window-size=1920,1080&quot;)
    options.add_argument(&quot;--disable-extensions&quot;)
    options.add_argument(&quot;--disable-notifications&quot;)
    options.add_argument(&quot;--disable-Advertisement&quot;)
    options.add_argument(&quot;--disable-popup-blocking&quot;)
    
    driver = webdriver.Chrome(executable_path= r&quot;E:\chromedriver\chromedriver.exe&quot;, options=options) #add your chrome path

    
    driver.get(url)
    last_height = driver.execute_script(&quot;return document.body.scrollHeight&quot;)
    
    loading = 0
    end_div = driver.find_element(&#39;class name&#39;,&#39;topic-content__load-more-anchor&#39;)
    while loading &lt; limit_loading:
        loading += 1
        print(f&#39;scrolling to page {loading}...&#39;)        
        end_div.location_once_scrolled_into_view
        time.sleep(2)
        
        
    article_links = []
    bsObj = BeautifulSoup(driver.page_source, &#39;html.parser&#39;)
    for i in bsObj.find(&#39;div&#39;, {&#39;class&#39;: &#39;content-box&#39;}).find(&#39;div&#39;, {&#39;class&#39;: &#39;topic-article-container&#39;}).find_all(&#39;h2&#39;, {&#39;class&#39;: &#39;article__title&#39;}):
        article_links.append(i.a[&#39;href&#39;])
    
    return article_links

Assuming I want to scroll 3 times in this category page and get back all the links on those pages.

get_article_links(&#39;https://www.scmp.com/topics/currencies&#39;, 3)

But it is neither scrolling nor getting me back the links as the problem faced by me. Any help with this will be really appreciated. Thanks~~

答案1

得分: 1

这个问题花了一些时间才解决！

显然，如果不先点击页面，就没有办法滚动到页面底部。而且只需按下END键就容易多了。我从这里获得了发送按键的想法（https://stackoverflow.com/questions/27775759/send-keys-control-click-in-selenium-with-python-bindings）。

滚动后，寻找包含所有文章的div，它的类是'css-1279nek ef7usvb8'。
然后使用正则表达式匹配所有href，仅包括网页地址。最后使用set()来去除重复项。

    driver.get(url)
    elem = driver.find_element_by_xpath("//body")
    for i in range(limit_loading):
        action = ActionChains(driver)
        action.click().perform()
        elem.send_keys(Keys.END)
        time.sleep(5)
        
    bsObj = BeautifulSoup(driver.page_source, 'html.parser')
    article_div = bsObj.find('div', {'class': "css-1279nek ef7usvb8"})
    article_links = list(set(re.findall('href="([\/A-z0-9.-]+)', str(article_div)))

英文:

This one took a while to solve!

Apparently there is no way of scrolling to the bottom of the page if you don't click it first. And it's also much easier to just send the END key.
I got the idea of sending keys from here.

After scrolling, look for the div that contains all articles, its class is 'css-1279nek ef7usvb8'.
Then use regex to match all href, including only the webpage address. Lastly use set() to get rid of duplicates.

    driver.get(url)
    elem = driver.find_element_by_xpath(&quot;//body&quot;)
    for i in range(limit_loading):
        action = ActionChains(driver)
        action.click().perform()
        elem.send_keys(Keys.END)
        time.sleep(5)
        
    bsObj = BeautifulSoup(driver.page_source, &#39;html.parser&#39;)
    article_div = bsObj.find(&#39;div&#39;, {&#39;class&#39;: &quot;css-1279nek ef7usvb8&quot;})
    article_links = list(set(re.findall(&#39;href=&quot;([\/A-z0-9.-]+)&#39;, str(article_div))))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从网页中抓取所有链接的链接并向下滚动

问题

答案1

如何在Django查询中检查日期是否仍然少于5天？

在Python中构建基于文本的游戏。

TypeError: 类型为Properties的对象不可JSON序列化 (Sagemaker管道)

按月分组，对一列求和，对另一列求平均。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论