2023年2月6日 05:35:27go评论161阅读模式

英文:

Python - Scraping text inside <br> which is not under a <p>

问题

我正在尝试抓取这个网站的内容: https://public.era.nih.gov/pubroster/roster.era?CID=102353，我可以对以"ANANDASABAPATHY"开头的名字进行操作，它们包含在一个“p”标签内：

driver.get(url)

content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content, "html.parser")

column = soup.find_all("p")

然后通过元素的长度进行操作：

for bullet in column:
    if len(bullet.find_all("br")) == 4:
        person = {}
        person["NAME"] = bullet.contents[0].strip()
        person["PROFESSION"] = bullet.contents[2].strip()
        person["DEPARTMENT"] = bullet.contents[4].strip()
        person["INSTITUTION"] = bullet.contents[6].strip()
        person["LOCATION"] = bullet.contents[8].strip()

然而，我遇到了两个问题。

我无法抓取主席（GUDJONSSON）的信息，因为它不包含在一个“p”标签内。我尝试了以下方法，但没有成功：

soup.find("b").findNext('br').findNext('br').findNext('br').contents[0].strip()

我无法区分最后两个人（WONDRAK和GERSCH），因为它们都包含在同一个“p”标签内。

任何帮助将非常有用！先谢谢！

英文:

I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/roster.era?CID=102353 and I am able to do it for the names beginning with ANANDASABAPATHY which are contained inside a "p" tag:

driver.get(url)

content = driver.page_source.encode(&#39;utf-8&#39;).strip()
soup = BeautifulSoup(content,&quot;html.parser&quot;)

column = soup.find_all(&quot;p&quot;)

and then playing with the length of the element:

for bullet in column:
        if len(bullet.find_all(&quot;br&quot;))==4:
            person = {}
            person[&quot;NAME&quot;]=bullet.contents[0].strip()
            person[&quot;PROFESSION&quot;]=bullet.contents[2].strip()
            person[&quot;DEPARTMENT&quot;]=bullet.contents[4].strip()
            person[&quot;INSTITUTION&quot;]=bullet.contents[6].strip()
            person[&quot;LOCATION&quot;]=bullet.contents[8].strip()

However, I have 2 issues.

I am unable to scrape the information for the chairperson (GUDJONSSON) which is not contained inside a "p" tag. I was trying something like:

> soup.find("b").findNext('br').findNext('br').findNext('br').contents[0].strip()

but it is not working

I am unable to differentiate between the last 2 persons (WONDRAK and GERSCH) because they are both contained inside the same "p" tag.

Any help would be extremely useful! Thanks in advance!

答案1

得分: 1

以下是您要翻译的代码部分：

这是一种情况，处理数据更像纯文本而不是HTML可能更容易，特别是在最初提取所需元素之后。原因是HTML的格式不太适合解析，它不遵循非常统一的模式。[html5lib](https://pypi.org/project/html5lib/)包通常比`html.parser`更好地处理格式不佳的HTML，但在这种情况下没有显著帮助。

import re
from typing import Collection, Iterator

from bs4 import BeautifulSoup

def iter_lines(soup: BeautifulSoup, ignore: Collection[str] = ()) -> Iterator[str]:
    for sibling in soup.find('b').next_siblings:
        for block in sibling.stripped_strings:
            block_str = ' '.join(filter(None, (line.strip() for line in block.split('\n')))
            if block_str and block_str not in ignore:
                yield block_str

def group_people(soup: BeautifulSoup, ignore: Collection[str] = ()) -> list[list[str]]:
    zip_code_pattern = re.compile(r', \d+$')
    people = []
    person = []
    for line in iter_lines(soup, ignore):
        person.append(line)
        if zip_code_pattern.search(line):
            people.append(person)
            person = []
    return people

def normalize_person(raw_person: list[str]) -> dict[str, str | None]:
    return {
        'NAME': raw_person[0],
        'PROFESSION': raw_person[1] if len(raw_person) > 4 else None,
        'DEPARTMENT': next((line for line in raw_person if 'DEPARTMENT' in line), None),
        'INSTITUTION': raw_person[-2],
        'LOCATION': raw_person[-1],
    }

raw_people = group_people(soup, ignore={'SCIENTIFIC REVIEW OFFICER'})
normalized = [normalize_person(person) for person in raw_people]

这段代码处理HTML数据，提取和解析其中的信息。如果您需要进一步的解释或有其他问题，请随时提出。

英文:

This is a case where it may be easier to handle processing the data more as plain text than as HTML, after initially extracting the element you're looking for. The reason is that the HTML is not very well formatted for parsing / it doesn't follow a very uniform pattern. The html5lib package generally handles poorly formatted html better than html.parser, but it didn't help significantly in this case.

import re
from typing import Collection, Iterator

from bs4 import BeautifulSoup


def iter_lines(soup: BeautifulSoup, ignore: Collection[str] = ()) -&gt; Iterator[str]:
    for sibling in soup.find(&#39;b&#39;).next_siblings:
        for block in sibling.stripped_strings:
            block_str = &#39; &#39;.join(filter(None, (line.strip() for line in block.split(&#39;\n&#39;))))
            if block_str and block_str not in ignore:
                yield block_str


def group_people(soup: BeautifulSoup, ignore: Collection[str] = ()) -&gt; list[list[str]]:
    zip_code_pattern = re.compile(r&#39;, \d+$&#39;)
    people = []
    person = []
    for line in iter_lines(soup, ignore):
        person.append(line)
        if zip_code_pattern.search(line):
            people.append(person)
            person = []

    return people


def normalize_person(raw_person: list[str]) -&gt; dict[str, str | None]:
    return {
        &#39;NAME&#39;: raw_person[0],
        &#39;PROFESSION&#39;: raw_person[1] if len(raw_person) &gt; 4 else None,
        &#39;DEPARTMENT&#39;: next((line for line in raw_person if &#39;DEPARTMENT&#39; in line), None),
        &#39;INSTITUTION&#39;: raw_person[-2],
        &#39;LOCATION&#39;: raw_person[-1],
    }


raw_people = group_people(soup, ignore={&#39;SCIENTIFIC REVIEW OFFICER&#39;})
normalized = [normalize_person(person) for person in raw_people]

This works with both BeautifulSoup(content, 'html.parser') and BeautifulSoup(content, 'html5lib').

The iter_lines function finds the first <b> tag like you did before, and then yields a single string for each line that is displayed in a browser.

The group_people function groups the lines into separate people, using the zip code at the end to indicate that that person's entry is complete. It may be possible to combine this function with iter_lines and skip the regex, but this was slightly easier. Better formatted html would be more conducive to that approach.

The ignore parameter was used to skip the SCIENTIFIC REVIEW OFFICER header above the last person on that page.

Lastly, the normalize_person function attempts to interpret what each line for a given person means. The name, institution, and location appear to be fairly consistent, but I took some liberties with profession and department to use None when it seemed like a value did not exist. Those decisions were only made based on the particular page you linked to - you may need to adjust those for other pages. It uses negative indexes for the institution and location because the number of lines that existed for each person's data was variable.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python – 抓取不在<p>标签内的<br>标签中的文本。

问题

答案1

LP3THW ex17 Why is there garbled text at the end of a text file when copying files with python in powershell?

如何使用Python重叠图表？

使用贪婪行为匹配字符串在x次出现之后

如何在Python中执行具有强制梯度的线性回归？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论