如何迭代地从Beautiful Soup元素中检索正确的信息?

huangapple go评论94阅读模式
英文:

How to iteratively retrieve the right information from beautiful soup elements?

问题

以下是翻译好的部分:

"我尝试从EZB新闻稿中检索信息。为此,我使用BeautifulSoup。由于新闻稿的结构(HTML)随着时间的推移而变化,使用单个选择器难以检索新闻稿的日期。因此,我尝试使用"try和except"以及"if/else语句"来从所有HTML文件中检索日期。不幸的是,我的代码不像我希望的那样工作,因为我无法从所有新闻稿中获得合适的日期。

有谁知道如何迭代多个soup元素并选择正确的元素以从相应的HTML文件中选择日期吗?

附上我的代码:"

from pandas.core.internals.managers import ensure_block_shape
import bs4, requests

pr_list = []

def parseContent(Urls):
  for x in Urls:
   res = requests.get(x)
   article = bs4.BeautifulSoup(res.text, 'html.parser')
   try:
    date = article.select('#main-wrapper > main > div.section > p.ecb-publicationDate')
    if date:
      for x in date:
        date = x.text.strip()   
    date = article.select('#main-wrapper > main > div.ecb-pressContentPubDate')
    if date:
      for x in date:
          date = x.text.strip()     
    else:
      date = article.select('#main-wrapper > main > div.title > ul > li.ecb-publicationDate')
      for x in date:
          date = x.text.strip()
   except:
    date = None
   try:
    title = article.select('#main-wrapper > main > div.title > h1')
    for x in title:
      title = x.text.strip()
   except:
    title = None
   try:
    body = article.select("#main-wrapper > main > div.section")
    for x in body:
      body = x.text.strip()
   except:
    body = None
   row = [date,title,body]
   pr_list.append(row)
英文:

I try to retrieve information from EZB press releases. To do so I use BeautifulSoup. Since the structure (HTML) of the press releases is changing over time, it is difficult to retrieve the date of the press releases with a single selector. Hence I tried to use "try and except" as well as "if/else statements" to retrieve the date from all HTML files. Unfortunately, my code does not work the way I want it to work since I do not get the adequate dates from all press releases.

Does anybody know how to iterate through multiple soup elements and choose the right element to select the date from the respective HTML file?

Attached my code:

from pandas.core.internals.managers import ensure_block_shape
import bs4, requests

pr_list = []

def parseContent(Urls):
  for x in Urls:
   res = requests.get(x)
   article = bs4.BeautifulSoup(res.text, 'html.parser')
   try:
    date = article.select('#main-wrapper > main > div.section > p.ecb-publicationDate')
    if date:
      for x in date:
        date = x.text.strip()   
    date = article.select('#main-wrapper > main > div.ecb-pressContentPubDate')
    if date:
      for x in date:
          date = x.text.strip()     
    else:
      date = article.select('#main-wrapper > main > div.title > ul > li.ecb-publicationDate')
      for x in date:
          date = x.text.strip()
   except:
    date = None
   try:
    title = article.select('#main-wrapper > main > div.title > h1')
    for x in title:
      title = x.text.strip()
   except:
    title = None
   try:
    body = article.select("#main-wrapper > main > div.section")
    for x in body:
      body = x.text.strip()
   except:
    body = None
   row = [date,title,body]
   pr_list.append(row)

答案1

得分: 1

将匹配表达式存储在一个列表中,然后对它们进行迭代,直到成功匹配一个:

import bs4
import requests


date_expressions = [
    "#main-wrapper > main > div.section > p.ecb-publicationDate",
    "#main-wrapper > main > div.ecb-pressContentPubDate",
    "#main-wrapper > main > div.title > ul > li.ecb-publicationDate",
]

title_expressions = [
    "#main-wrapper > main > div.title > h1",
]

body_expressions = [
    "#main-wrapper > main > div.section",
]


def try_several_expressions(article, expressions):
    """尝试使用给定的表达式列表匹配元素。

    如果找不到匹配项或找到多个匹配项,则引发ValueError。
    """

    for expr in expressions:
        res = article.select(expr)
        if res:
            break
    else:
        raise ValueError("未能匹配任何表达式")

    if len(res) > 1:
        raise ValueError("未能匹配唯一值")

    return res[0]


def parseContent(urls):
    pr_list = []
    for url in urls:
        res = requests.get(url)
        article = bs4.BeautifulSoup(res.text, "html.parser")
        date = try_several_expressions(article, date_expressions).text
        title = try_several_expressions(article, title_expressions).text
        body = try_several_expressions(article, body_expressions).text

        row = [date, title, body]
        pr_list.append(row)

    return pr_list

假设您的意思是“ECB”而不是“EZB”,我在 https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230710~77cf718c59.en.html 上测试了这个脚本,它似乎如预期地工作。


如果我按照我在评论中建议的一处更改(删除if len(res) > 1检查),使得try_several_expressions如下所示:

def try_several_expressions(article, expressions):
    """尝试使用给定的表达式列表匹配元素。

    如果找不到匹配项或找到多个匹配项,则引发ValueError。
    """

    for expr in expressions:
        res = article.select(expr)
        if res:
            break
    else:
        raise ValueError("未能匹配任何表达式")

    # 总是返回第一个匹配的元素
    return res[0]

然后该脚本对 您的列表 中的每个网址都有效,除了 https://www.ecb.europa.eu/press/pr/date/2020/html/ecb.pr2002242~8842dcb418.en.html,它没有任何内容。

如果在parseContent中放置一个try/except块,您可以简单地忽略该失败:

def parseContent(urls):
    pr_list = []
    for url in urls:
        res = requests.get(url)
        article = bs4.BeautifulSoup(res.text, "html.parser")
        try:
            date = try_several_expressions(article, date_expressions).text.strip()
            title = try_several_expressions(article, title_expressions).text.strip()
            body = try_several_expressions(article, body_expressions).text
        except ValueError:
            print(f'无法解析:{url}')
            continue

        row = [date, title, body]
        pr_list.append(row)

    return pr_list
英文:

Store your match expressions in a list and then iterate over them until one is successful:

import bs4
import requests


date_expressions = [
    "#main-wrapper > main > div.section > p.ecb-publicationDate",
    "#main-wrapper > main > div.ecb-pressContentPubDate",
    "#main-wrapper > main > div.title > ul > li.ecb-publicationDate",
]

title_expressions = [
    "#main-wrapper > main > div.title > h1",
]

body_expressions = [
    "#main-wrapper > main > div.section",
]


def try_several_expressions(article, expressions):
    """Try to match an element using the given list of expressions.

    Raise ValueError if we failed to find any matches or if we find
    multiple matches.
    """

    for expr in expressions:
        res = article.select(expr)
        if res:
            break
    else:
        raise ValueError("failed to match any expressions")

    if len(res) > 1:
        raise ValueError("failed to match a unique value")

    return res[0]


def parseContent(urls):
    pr_list = []
    for url in urls:
        res = requests.get(url)
        article = bs4.BeautifulSoup(res.text, "html.parser")
        date = try_several_expressions(article, date_expressions).text
        title = try_several_expressions(article, title_expressions).text
        body = try_several_expressions(article, body_expressions).text

        row = [date, title, body]
        pr_list.append(row)

    return pr_list

Assuming that you mean "ECB" rather than "EZB", I tested this against <https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230710~77cf718c59.en.html> and it seems to work as expected.


If I make the one change I suggested in my comment (remove the if len(res) &gt; 1 check), so that try_several_expressions looks like this:

def try_several_expressions(article, expressions):
    &quot;&quot;&quot;Try to match an element using the given list of expressions.

    Raise ValueError if we failed to find any matches or if we find
    multiple matches.
    &quot;&quot;&quot;

    for expr in expressions:
        res = article.select(expr)
        if res:
            break
    else:
        raise ValueError(&quot;failed to match any expressions&quot;)

    # Always return the first matched element
    return res[0]

Then the script works for every single url in your list except for <https://www.ecb.europa.eu/press/pr/date/2020/html/ecb.pr2002242~8842dcb418.en.html>, which doesn't have any content.

If you put a try/except block in parseContent, you can simply ignore that failure:

def parseContent(urls):
    pr_list = []
    for url in urls:
        res = requests.get(url)
        article = bs4.BeautifulSoup(res.text, &quot;html.parser&quot;)
        try:
            date = try_several_expressions(article, date_expressions).text.strip()
            title = try_several_expressions(article, title_expressions).text.strip()
            body = try_several_expressions(article, body_expressions).text
        except ValueError:
            print(f&#39;failed to parse: {url}&#39;)
            continue

        row = [date, title, body]
        pr_list.append(row)

    return pr_list

答案2

得分: 1

我已经为您改进了代码,如下所示:

from bs4 import BeautifulSoup
from pprint import pprint
import re
import requests

pr_list = []

urls = [
    'https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230710~77cf718c59.en.html',
    'https://www.ecb.europa.eu/press/pr/date/2012/html/pr120912_1.en.html'
]

def parse_content(urls):
    for url in urls:
        print(url)
        res = requests.get(url)
        page = BeautifulSoup(res.text, 'html.parser')

        # 初始化默认值
        row = [None, None, None]

        # 提取日期
        date_pattern = r'\d+ (January|February|March|April|May|June|July|August|September|October|November|December) \d{4}'
        date_element = page.find('main').find(attrs={'class': re.compile('Date')}, string=re.compile(date_pattern))
        if date_element:
            row[0] = date_element.text.strip()

        # 提取标题
        title_element = page.find('div', {'class': 'title'}).find('h1')
        if title_element:
            row[1] = title_element.text.strip()

        # 提取正文
        section_element = page.find('main').find('div', {'class': 'section'})
        if section_element:
            row[2] = section_element.text.strip()

        pr_list.append(row)

parse_content(urls)
pprint(pr_list)

请注意,我使用了正则表达式来查找日期,因为日期在您提供的示例中遵循这种模式,并且在main标签中具有Date类名。

输出结果为:

[
    ['10 July 2023',
    'ECB surveys Europeans on new themes for euro banknotes',
    '10 July 2023 Europeans invited to express preferences on shortlisted themes in public survey open until 31 August 2023 ECB’s Governing Council expected to choose future theme by 2024, and final designs in 2026 The European Central Bank (ECB) is asking European citizens about their views on the proposed themes for the next series of euro banknotes. From 10 July until 31 August 2023 everybody in the euro area can respond to a survey on the ECB’s website. In addition, to ensure opinions from across the euro area are equally represented, the ECB has contracted an independent research company to ask a representative sample of people in the euro area the same questions as those in its own survey. ECB President Christine Lagarde invites everybody to participate in the survey. She said “There is a strong link between our single currency and our shared European identity, and our new series of banknotes should emphasise this. We want Europeans to identify with the design of euro banknotes, which is why they will play an active role in selecting the new theme.” Developing our future euro banknotes “We are working on a new series of high-tech banknotes with a view to preventing counterfeiting and reducing environmental impact,” said Executive Board member Fabio Panetta. “We are committed to cash and to ensuring that paying with public money is always an option.” It is the duty of the ECB and the euro area national central banks to ensure euro banknotes remain an innovative, secure and efficient means of payment. Developing new series of banknotes is a standard practice for all central banks. In a world where reproduction technologies are rapidly evolving and where counterfeiters can easily access information and materials, it is necessary to issue new banknotes on a regular basis. Beyond security considerations, the ECB is committed to reducing the environmental impact of euro banknotes throughout their life cycle, while also making them more relatable and inclusive for Europeans of all ages and backgrounds, including vulnerable groups such as people with visual impairment. Shortlisted themes for future banknotes The seven themes shortlisted by the ECB’s Governing Council are listed below. [1] Birds: free, resilient, inspiring Birds know nothing of national borders and symbolise freedom of movement. Their nests remind us of our own desire to build places and societies that nurture and protect the future. They remind us that we share our continent with all the lifeforms that sustain our common existence. European culture Europe’s rich cultural heritage and dynamic cultural and creative sectors strengthen the European identity, forging a shared sense of belonging. Culture promotes common values, inclusion and dialogue in Europe and across the globe. It brings people together. European values mirrored in nature Europe is a living place, but also an idea. The European Union is an organisation, but also a set of values. The theme highlights the role of European values (human dignity, freedom, democracy, equality, the rule of law and human rights) as the building blocks of Europe and links these values to our respect for nature and the preservation of the environment. The future is yours The ideas and innovations that will shape the future of Europe lie deep within every European. The images created for this theme represent the bearers of the collective imagination through which people will create this shared future. This theme signifies the boundless potential of Europeans. Hands: together we build Europe Hands are familiar to all of us but no two pairs are the same. Hands built Europe, its physical infrastructure, its artistic heritage and its achievements. Hands build, weave, heal, teach, connect and guide us. Hands tell stories of labour, age and relationships, of heritage, history, and culture. This theme celebrates the hands that have built Europe and continue to do so every day. Our Europe, ourselves We grow up as individuals but also as part of a community, through our relationships with one another. We have our own stories and identities, but we also share a common identity as Europeans. This theme evokes the freedom, values and openness of people in Europe. Rivers: the waters of life in Europe Europe's rivers cross borders. They connect us to each other and to nature. They represent the ebb and flow of a dynamic, ever-changing continent. They nurture us and remind us of the deep sources of our common life, and we must nurture them in turn. The shortlist of themes takes into account the suggestions made by a multidisciplinary advisory group, with members from all euro area countries. Timeline for the new designs The outcome of the surveys will be used by the ECB to select the theme for the next generation of banknotes by 2024. After that a design competition will take place. European citizens will again have the chance to express their preferences on the design options resulting from that competition. The ECB is expected to take the decision on the future design, and on when to produce and issue the new banknotes, in 2026. For media queries, please contact Belén Pérez Esteve, tel.: +49 173 533 4269.'],
    ['12 September 2012',
    'ECB extends the swap facility agreement with the Bank of England',
    'The Governing Council of the European Central Bank (ECB) has decided, in agreement with the Bank of England, to extend the liquidity swap arrangement with the Bank of England up to 30 September 2013. The swap facility agreement established on 17 December 2010 had been authorised until the end of September 2011 and then extended until 28 September 

<details>
<summary>英文:</summary>

Improved your code as follows:
- Removed unnecessary try-except blocks
- Reduced complex logic and selectors and replaced them with static selectors and regex-based dynamic selectors.

from bs4 import BeautifulSoup
from pprint import pprint
import re
import requests

pr_list = []

urls = [
'https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230710~77cf718c59.en.html',
'https://www.ecb.europa.eu/press/pr/date/2012/html/pr120912_1.en.html'
]

def parse_content(urls):
for url in urls:
print(url)
res = requests.get(url)
page = BeautifulSoup(res.text, 'html.parser')

	# initializing default values
row = [None ,None ,None]
#for dates
if page.find(&#39;main&#39;).find(attrs={&#39;class&#39;: re.compile(&#39;Date&#39;)}, string=re.compile(&#39;\d+ (January|February|March|April|May|June|July|August|September|October|November|December) \d{4}&#39;)):
row[0] = page.find(&#39;main&#39;).find(attrs={&#39;class&#39;: re.compile(&#39;Date&#39;)}, string=re.compile(&#39;\d+ (January|February|March|April|May|June|July|August|September|October|November|December) \d{4}&#39;)).text.strip()
# getting title
row[1] = page.find(&#39;div&#39;, {&#39;class&#39;: &#39;title&#39;}).find(&#39;h1&#39;).text.strip() if page.find(&#39;div&#39;, {&#39;class&#39;: &#39;title&#39;}) and page.find(&#39;div&#39;, {&#39;class&#39;: &#39;title&#39;}).find(&#39;h1&#39;) else None
# getting body
row[2] = page.find(&#39;main&#39;).find(&#39;div&#39;, {&#39;class&#39;: &#39;section&#39;}).text.strip() if page.find(&#39;div&#39;, {&#39;class&#39;: &#39;section&#39;}) else None
pr_list.append(row)

parse_content(urls)
pprint(pr_list)


Note that I used regex to find dates, since dates were following this pattern in the examples that you had provided, along with having `Date` in their class names, in the `main` tag.
Output is

https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230710~77cf718c59.en.html
https://www.ecb.europa.eu/press/pr/date/2012/html/pr120912_1.en.html
[['10 July 2023',
'ECB surveys Europeans on new themes for euro banknotes',
'10 July 2023Europeans invited to express preferences on shortlisted themes '
'in public survey open until 31\xa0August 2023ECB’s Governing Council '
'expected to choose future theme by 2024, and final designs in 2026The '
'European Central Bank (ECB) is asking European citizens about their views '
'on the proposed themes for the next series of euro banknotes. From 10 July '
'until 31 August 2023 everybody in the euro area can respond to a survey on '
'the ECB’s website. In addition, to ensure opinions from across the euro '
'area are equally represented, the ECB has contracted an independent '
'research company to ask a representative sample of people in the euro area '
'the same questions as those in its own survey.ECB President Christine '
'Lagarde invites everybody to participate in the survey. She said “There is '
'a strong link between our single currency and our shared European identity, '
'and our new series of banknotes should emphasise this. We want Europeans to '
'identify with the design of euro banknotes, which is why they will play an '
'active role in selecting the new theme.”Developing our future euro '
'banknotes“We are working on a new series of high-tech banknotes with a view '
'to preventing counterfeiting and reducing environmental impact,” said '
'Executive Board member Fabio Panetta. “We are committed to cash and to '
'ensuring that paying with public money is always an option.”It is the duty '
'of the ECB and the euro area national central banks to ensure euro '
'banknotes remain an innovative, secure and efficient means of payment. '
'Developing new series of banknotes is a standard practice for all central '
'banks. In a world where reproduction technologies are rapidly evolving and '
'where counterfeiters can easily access information and materials, it is '
'necessary to issue new banknotes on a regular basis. Beyond security '
'considerations, the ECB is committed to reducing the environmental impact '
'of euro banknotes throughout their life cycle, while also making them more '
'relatable and inclusive for Europeans of all ages and backgrounds, '
'including vulnerable groups such as people with visual '
'impairment.Shortlisted themes for future banknotesThe seven themes '
'shortlisted by the ECB’s Governing Council are listed below.[1]Birds: free, '
'resilient, inspiringBirds know nothing of national borders and symbolise '
'freedom of movement. Their nests remind us of our own desire to build '
'places and societies that nurture and protect the future. They remind us '
'that we share our continent with all the lifeforms that sustain our common '
'existence.European cultureEurope’s rich cultural heritage and dynamic '
'cultural and creative sectors strengthen the European identity, forging a '
'shared sense of belonging. Culture promotes common values, inclusion and '
'dialogue in Europe and across the globe. It brings people together.European '
'values mirrored in natureEurope is a living place, but also an idea. The '
'European Union is an organisation, but also a set of values. The theme '
'highlights the role of European values (human dignity, freedom, democracy, '
'equality, the rule of law and human rights) as the building blocks of '
'Europe and links these values to our respect for nature and the '
'preservation of the environment.The future is yoursThe ideas and '
'innovations that will shape the future of Europe lie deep within every '
'European. The images created for this theme represent the bearers of the '
'collective imagination through which people will create this shared future. '
'This theme signifies the boundless potential of Europeans.Hands: together '
'we build EuropeHands are familiar to all of us but no two pairs are the '
'same. Hands built Europe, its physical infrastructure, its artistic '
'heritage and its achievements. Hands build, weave, heal, teach, connect and '
'guide us. Hands tell stories of labour, age and relationships, of heritage, '
'history, and culture. This theme celebrates the hands that have built '
'Europe and continue to do so every day.\xa0Our Europe, ourselvesWe grow up '
'as individuals but also as part of a community, through our relationships '
'with one another. We have our own stories and identities, but we also share '
'a common identity as Europeans. This theme evokes the freedom, values and '
"openness of people in Europe.Rivers: the waters of life in EuropeEurope's "
'rivers cross borders. They connect us to each other and to nature. They '
'represent the ebb and flow of a dynamic, ever-changing continent. They '
'nurture us and remind us of the deep sources of our common life, and we '
'must nurture them in turn.The shortlist of themes takes into account the '
'suggestions made by a multidisciplinary advisory group, with members from '
'all euro area countries.Timeline for the new designsThe outcome of the '
'surveys will be used by the ECB to select the theme for the next generation '
'of banknotes by 2024. After that a design competition will take place. '
'European citizens will again have the chance to express their preferences '
'on the design options resulting from that competition. The ECB is expected '
'to take the decision on the future design, and on when to produce and issue '
'the new banknotes, in 2026.For media queries, please contact Belén Pérez '
'Esteve, tel.: +49 173 533 4269.'],
['12 September 2012',
'ECB extends the swap facility agreement \u2028with the Bank of England',
'The Governing Council of the European Central Bank (ECB) has decided, in '
'agreement with the Bank of England, to extend the liquidity swap '
'arrangement with the Bank of England up to \u2028'
'30 September 2013. The swap facility agreement established on 17 December '
'2010 had been authorised until the end of September 2011 and then extended '
'until 28 September 2012.\n'
'The related announcement by the Bank of England is available at their '
'website http://www.bankofengland.co.uk.']]


</details>

huangapple
  • 本文由 发表于 2023年7月18日 03:38:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76707619.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定