Python – 抓取不在<p>标签内的<br>标签中的文本。

huangapple go评论65阅读模式
英文:

Python - Scraping text inside <br> which is not under a <p>

问题

我正在尝试抓取这个网站的内容: https://public.era.nih.gov/pubroster/roster.era?CID=102353,我可以对以"ANANDASABAPATHY"开头的名字进行操作,它们包含在一个“p”标签内:

driver.get(url)

content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content, "html.parser")

column = soup.find_all("p")

然后通过元素的长度进行操作:

for bullet in column:
    if len(bullet.find_all("br")) == 4:
        person = {}
        person["NAME"] = bullet.contents[0].strip()
        person["PROFESSION"] = bullet.contents[2].strip()
        person["DEPARTMENT"] = bullet.contents[4].strip()
        person["INSTITUTION"] = bullet.contents[6].strip()
        person["LOCATION"] = bullet.contents[8].strip()

然而,我遇到了两个问题。

  1. 我无法抓取主席(GUDJONSSON)的信息,因为它不包含在一个“p”标签内。我尝试了以下方法,但没有成功:
soup.find("b").findNext('br').findNext('br').findNext('br').contents[0].strip()
  1. 我无法区分最后两个人(WONDRAK和GERSCH),因为它们都包含在同一个“p”标签内。

任何帮助将非常有用!先谢谢!

英文:

I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/roster.era?CID=102353 and I am able to do it for the names beginning with ANANDASABAPATHY which are contained inside a "p" tag:

driver.get(url)

content = driver.page_source.encode(&#39;utf-8&#39;).strip()
soup = BeautifulSoup(content,&quot;html.parser&quot;)

column = soup.find_all(&quot;p&quot;)

and then playing with the length of the element:

for bullet in column:
        if len(bullet.find_all(&quot;br&quot;))==4:
            person = {}
            person[&quot;NAME&quot;]=bullet.contents[0].strip()
            person[&quot;PROFESSION&quot;]=bullet.contents[2].strip()
            person[&quot;DEPARTMENT&quot;]=bullet.contents[4].strip()
            person[&quot;INSTITUTION&quot;]=bullet.contents[6].strip()
            person[&quot;LOCATION&quot;]=bullet.contents[8].strip()

However, I have 2 issues.

  1. I am unable to scrape the information for the chairperson (GUDJONSSON) which is not contained inside a "p" tag. I was trying something like:

> soup.find("b").findNext('br').findNext('br').findNext('br').contents[0].strip()

but it is not working

  1. I am unable to differentiate between the last 2 persons (WONDRAK and GERSCH) because they are both contained inside the same "p" tag.

Any help would be extremely useful! Thanks in advance!

答案1

得分: 1

以下是您要翻译的代码部分:

这是一种情况处理数据更像纯文本而不是HTML可能更容易特别是在最初提取所需元素之后原因是HTML的格式不太适合解析它不遵循非常统一的模式[html5lib](https://pypi.org/project/html5lib/)包通常比`html.parser`更好地处理格式不佳的HTML但在这种情况下没有显著帮助

import re
from typing import Collection, Iterator

from bs4 import BeautifulSoup

def iter_lines(soup: BeautifulSoup, ignore: Collection[str] = ()) -> Iterator[str]:
    for sibling in soup.find('b').next_siblings:
        for block in sibling.stripped_strings:
            block_str = ' '.join(filter(None, (line.strip() for line in block.split('\n')))
            if block_str and block_str not in ignore:
                yield block_str

def group_people(soup: BeautifulSoup, ignore: Collection[str] = ()) -> list[list[str]]:
    zip_code_pattern = re.compile(r', \d+$')
    people = []
    person = []
    for line in iter_lines(soup, ignore):
        person.append(line)
        if zip_code_pattern.search(line):
            people.append(person)
            person = []
    return people

def normalize_person(raw_person: list[str]) -> dict[str, str | None]:
    return {
        'NAME': raw_person[0],
        'PROFESSION': raw_person[1] if len(raw_person) > 4 else None,
        'DEPARTMENT': next((line for line in raw_person if 'DEPARTMENT' in line), None),
        'INSTITUTION': raw_person[-2],
        'LOCATION': raw_person[-1],
    }

raw_people = group_people(soup, ignore={'SCIENTIFIC REVIEW OFFICER'})
normalized = [normalize_person(person) for person in raw_people]

这段代码处理HTML数据,提取和解析其中的信息。如果您需要进一步的解释或有其他问题,请随时提出。

英文:

This is a case where it may be easier to handle processing the data more as plain text than as HTML, after initially extracting the element you're looking for. The reason is that the HTML is not very well formatted for parsing / it doesn't follow a very uniform pattern. The html5lib package generally handles poorly formatted html better than html.parser, but it didn't help significantly in this case.

import re
from typing import Collection, Iterator

from bs4 import BeautifulSoup


def iter_lines(soup: BeautifulSoup, ignore: Collection[str] = ()) -&gt; Iterator[str]:
    for sibling in soup.find(&#39;b&#39;).next_siblings:
        for block in sibling.stripped_strings:
            block_str = &#39; &#39;.join(filter(None, (line.strip() for line in block.split(&#39;\n&#39;))))
            if block_str and block_str not in ignore:
                yield block_str


def group_people(soup: BeautifulSoup, ignore: Collection[str] = ()) -&gt; list[list[str]]:
    zip_code_pattern = re.compile(r&#39;, \d+$&#39;)
    people = []
    person = []
    for line in iter_lines(soup, ignore):
        person.append(line)
        if zip_code_pattern.search(line):
            people.append(person)
            person = []

    return people


def normalize_person(raw_person: list[str]) -&gt; dict[str, str | None]:
    return {
        &#39;NAME&#39;: raw_person[0],
        &#39;PROFESSION&#39;: raw_person[1] if len(raw_person) &gt; 4 else None,
        &#39;DEPARTMENT&#39;: next((line for line in raw_person if &#39;DEPARTMENT&#39; in line), None),
        &#39;INSTITUTION&#39;: raw_person[-2],
        &#39;LOCATION&#39;: raw_person[-1],
    }


raw_people = group_people(soup, ignore={&#39;SCIENTIFIC REVIEW OFFICER&#39;})
normalized = [normalize_person(person) for person in raw_people]

This works with both BeautifulSoup(content, &#39;html.parser&#39;) and BeautifulSoup(content, &#39;html5lib&#39;).

The iter_lines function finds the first &lt;b&gt; tag like you did before, and then yields a single string for each line that is displayed in a browser.

The group_people function groups the lines into separate people, using the zip code at the end to indicate that that person's entry is complete. It may be possible to combine this function with iter_lines and skip the regex, but this was slightly easier. Better formatted html would be more conducive to that approach.

The ignore parameter was used to skip the SCIENTIFIC REVIEW OFFICER header above the last person on that page.

Lastly, the normalize_person function attempts to interpret what each line for a given person means. The name, institution, and location appear to be fairly consistent, but I took some liberties with profession and department to use None when it seemed like a value did not exist. Those decisions were only made based on the particular page you linked to - you may need to adjust those for other pages. It uses negative indexes for the institution and location because the number of lines that existed for each person's data was variable.

huangapple
  • 本文由 发表于 2023年2月6日 05:35:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/75355689.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定