Web Scraping News Articles Python(使用Python进行网页抓取新闻文章)

huangapple go评论136阅读模式
英文:

Web Scraping News Articles Python

问题

我需要从这个网站https://www.rbi.org.in/scripts/NewLinkDetails.aspx进行爬取。

这是一个包含印度央行新闻的网站。我们需要使用Python和asyncio的playwright库。

该页面的HTML模式如下:

每个链接都包含一个URL,需要访问并获取新闻。

例如,如果我们访问https://www.rbi.org.in/Scripts/BS_PressReleaseDisplay.aspx?prid=56032,HTML结构如下。

这里的表头是新闻标题,我们需要获取它。

从这个HTML模式中,我们需要获取日期标签。

从这个HTML模式中,我们可以提取新闻内容。每个p标签都包含新闻内容。因此,我们需要从每个网址获取所有的p标签。

我正在使用以下代码:

import asyncio
from playwright.async_api import async_playwright

async def scrape_rbi_news():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        context = await browser.new_context()

        page = await context.new_page()
        await page.goto('https://www.rbi.org.in/scripts/NewLinkDetails.aspx')

        # Wait for the page to load and display the links
        await page.wait_for_selector('.link2')

        # Get all news links
        news_links = await page.query_selector_all('.link2')

        # Get the first 10 news links
        top_10_links = news_links[:10]

        for link in top_10_links:
            link_url = await link.get_attribute('href')

            # Open each news link
            await page.goto(link_url)
            await asyncio.sleep(2)  # Add a delay of 2 seconds for the page to load

            try:
                # Wait for the title and date elements to be attached to the DOM
                await page.wait_for_selector('.tableheader b', timeout=5000)
                await page.wait_for_selector('.tableheader b:has-text(" Date : ")', timeout=5000)

                # Extract news date using JavaScript evaluation
                news_date_element = await page.query_selector('.tableheader b:has-text(" Date : ")')
                news_date = await news_date_element.evaluate('(element) => element.nextSibling.textContent')

                # Extract news content
                news_content_elements = await page.query_selector_all('.tablecontent1 p')
                news_content = '\n'.join([await element.inner_text() for element in news_content_elements])

                # Print extracted data for each news article
                print('URL:', link_url)
                print('Date:', news_date.strip())
                print('Content:', news_content)
                print('---')
            except Exception as e:
                print('Error:', str(e))

        await browser.close()

# Run the scraping function
if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(scrape_rbi_news())

它正确打印出第一条新闻。之后就出错了。我看到以下错误:

playwright._impl._api_types.Error: Element is not attached to the DOM

有什么建议如何解决这个问题?

英文:

I need to scrape from this website https://www.rbi.org.in/scripts/NewLinkDetails.aspx

This is website that contains news from central bank of India. We need to use playwright for python and asyncio.

The html pattern of this page is following:

Each of this link contains url where need to go and get news.

<a class="link2" href="https://www.rbi.org.in/Scripts/BS_PressReleaseDisplay.aspx?prid=56032">Governor, Reserve Bank of India meets MD & CEOs of Public and Private Sector Banks</a>

For example if we go to https://www.rbi.org.in/Scripts/BS_PressReleaseDisplay.aspx?prid=56032
the hmtl structure is following.

We have here tableheader is news title. We need to get it.

<td align="center" class="tableheader"><b>Governor, Reserve Bank of India meets MD & CEOs of Public and Private Sector Banks</b></td>

From this html pattern we need to get the Date tag.

<td align="right" class="tableheader"><b> Date : </b>Jul 11, 2023</td>

From this hmtl pattern we can extract news content. Each p tag contains news content. Therefore we need to get all p from each web url.

<tr class="tablecontent1"><td><table width="100%" border="0" align="center" class="td">  <tbody><tr>    
<td><p>The Governor, Reserve Bank of India held meetings with the MD & CEOs of Public Sector Banks and select Private Sector Banks on July 11, 2023 at Mumbai. 
The meetings were also attended by Deputy Governors, Shri M. Rajeshwar Rao and Shri Swaminathan J., along with a few senior officials of the RBI. </p>     
<p>The Governor in his introductory remarks, while noting the good performance of the Indian banking system despite various adverse global developments.</p>    
<p>The issues relating to strengthening of credit underwriting standards, monitoring of large exposures, implementation of External Benchmark Linked Rate (EBLR) Guidelines,
bolstering IT security and IT governance, improving recovery from written-off accounts, and timely and accurate sharing of information with Credit Information Companies 
were discussed.</p>     
<p align="right"><span class="head">(Yogesh Dayal)     </span><br>      Chief General Manager</p>    
<p class="head">Press Release: 2023-2024/582</p></td>  </tr></tbody></table></td> </tr>

I am using this code:

import asyncio
from playwright.async_api import async_playwright
async def scrape_rbi_news():
async with async_playwright() as p:
browser = await p.chromium.launch()
context = await browser.new_context()
page = await context.new_page()
await page.goto('https://www.rbi.org.in/scripts/NewLinkDetails.aspx')
# Wait for the page to load and display the links
await page.wait_for_selector('.link2')
# Get all news links
news_links = await page.query_selector_all('.link2')
# Get the first 10 news links
top_10_links = news_links[:10]
for link in top_10_links:
link_url = await link.get_attribute('href')
# Open each news link
await page.goto(link_url)
await asyncio.sleep(2)  # Add a delay of 2 seconds for the page to load
try:
# Wait for the title and date elements to be attached to the DOM
await page.wait_for_selector('.tableheader b', timeout=5000)
await page.wait_for_selector('.tableheader b:has-text(" Date : ")', timeout=5000)
# Extract news date using JavaScript evaluation
news_date_element = await page.query_selector('.tableheader b:has-text(" Date : ")')
news_date = await news_date_element.evaluate('(element) => element.nextSibling.textContent')
# Extract news content
news_content_elements = await page.query_selector_all('.tablecontent1 p')
news_content = '\n'.join([await element.inner_text() for element in news_content_elements])
# Print extracted data for each news article
print('URL:', link_url)
print('Date:', news_date.strip())
print('Content:', news_content)
print('---')
except Exception as e:
print('Error:', str(e))
await browser.close()
# Run the scraping function
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(scrape_rbi_news())

It print the first news correctly. After that it breaks. I see this error:

playwright._impl._api_types.Error: Element is not attached to the DOM

Any suggestion how to solve this issue?

答案1

得分: 0

你的问题出在这一行代码上:link_url = await link.get_attribute('href')

你在首页,然后获取了第一个链接的属性,并导航到了那个新链接。

当你在新闻页面时,你尝试再次执行link_url = await link.get_attribute('href'),但是该元素不再存在于页面中,所以你无法获取一个不存在的元素的href属性。

你应该在循环之前将链接保存到一个数组中。

以下是修改后的脚本(我使用了自己的选择器,因为我对CSS不太熟悉,所以我使用了XPath):

import asyncio
from playwright.async_api import async_playwright

async def scrape_rbi_news():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        context = await browser.new_context()

        page = await context.new_page()
        await page.goto('https://www.rbi.org.in/scripts/NewLinkDetails.aspx')

        # 等待页面加载并显示链接
        await page.wait_for_selector('.link2')

        # 获取所有新闻链接
        news_links = await page.locator('.link2').all()

        # 获取前10个新闻链接
        top_10_links = news_links[:10]
        links = []
        # 在循环之前,我们将链接保存为文本,而不是元素,以避免我之前提到的问题
        for link_element in top_10_links:
            links.append(await link_element.get_attribute('href'))

        for link in links:
            # 打开每个新闻链接
            await page.goto(link)
            await asyncio.sleep(2)  # 添加2秒的延迟,以便页面加载

            try:
                # 等待标题和日期元素附加到DOM
                date = await page.locator("(//td[@class='tableheader'])[2]").inner_text()
                title = await page.locator("(//td[@class='tableheader']/b)[2]").inner_text()
                content = await page.locator("//tr[@class='tablecontent1']//p").all_inner_texts()
                content = '\n\n'.join(content)

                print('URL:', link)
                print(date)
                print(title)
                print('Content:', content)
                print('---')
            except Exception as e:
                print('Error:', str(e))

        await browser.close()

# 运行爬取函数
if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(scrape_rbi_news())
英文:

Your problem is in the line link_url = await link.get_attribute('href')

You are in the index page, then you get the attribute of first link and you navigate to that new link

When you are in the news page, you are trying to do again link_url = await link.get_attribute('href') that element is not in the page anymore, so you can not get the href of an element that does not exist.

You should save the links into an array before making the loop

Here your script after that change (I did my own selectors because I am not very familiar with CSS, so I used Xpath)

import asyncio
from playwright.async_api import async_playwright
async def scrape_rbi_news():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
await page.goto('https://www.rbi.org.in/scripts/NewLinkDetails.aspx')
# Wait for the page to load and display the links
await page.wait_for_selector('.link2')
# Get all news links
news_links = await page.locator('.link2').all()
# Get the first 10 news links
top_10_links = news_links[:10]
links = []
#  Here we are going to save the link as text, instead of element in order to avoid the problem I commented before
for link_element in top_10_links:
links.append(await link_element.get_attribute('href'))
for link in links:
# Open each news link
await page.goto(link)
await asyncio.sleep(2)  # Add a delay of 2 seconds for the page to load
try:
# Wait for the title and date elements to be attached to the DOM
date = await page.locator("(//td[@class='tableheader'])[2]").inner_text()
title = await page.locator("(//td[@class='tableheader']/b)[2]").inner_text()
content = await page.locator("//tr[@class='tablecontent1']//p").all_inner_texts()
content = '\n\n'.join(content)
print('URL:', link)
print(date)
print(title)
print('Content:', content)
print('---')
except Exception as e:
print('Error:', str(e))
await browser.close()
# Run the scraping function
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(scrape_rbi_news())

huangapple
  • 本文由 发表于 2023年8月9日 17:21:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76866299.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定