英文:
Python - Scraping text inside <br> which is not under a <p>
问题
我正在尝试抓取这个网站的内容: https://public.era.nih.gov/pubroster/roster.era?CID=102353,我可以对以"ANANDASABAPATHY"开头的名字进行操作,它们包含在一个“p”标签内:
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content, "html.parser")
column = soup.find_all("p")
然后通过元素的长度进行操作:
for bullet in column:
if len(bullet.find_all("br")) == 4:
person = {}
person["NAME"] = bullet.contents[0].strip()
person["PROFESSION"] = bullet.contents[2].strip()
person["DEPARTMENT"] = bullet.contents[4].strip()
person["INSTITUTION"] = bullet.contents[6].strip()
person["LOCATION"] = bullet.contents[8].strip()
然而,我遇到了两个问题。
- 我无法抓取主席(GUDJONSSON)的信息,因为它不包含在一个“p”标签内。我尝试了以下方法,但没有成功:
soup.find("b").findNext('br').findNext('br').findNext('br').contents[0].strip()
- 我无法区分最后两个人(WONDRAK和GERSCH),因为它们都包含在同一个“p”标签内。
任何帮助将非常有用!先谢谢!
英文:
I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/roster.era?CID=102353 and I am able to do it for the names beginning with ANANDASABAPATHY which are contained inside a "p" tag:
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("p")
and then playing with the length of the element:
for bullet in column:
if len(bullet.find_all("br"))==4:
person = {}
person["NAME"]=bullet.contents[0].strip()
person["PROFESSION"]=bullet.contents[2].strip()
person["DEPARTMENT"]=bullet.contents[4].strip()
person["INSTITUTION"]=bullet.contents[6].strip()
person["LOCATION"]=bullet.contents[8].strip()
However, I have 2 issues.
- I am unable to scrape the information for the chairperson (GUDJONSSON) which is not contained inside a "p" tag. I was trying something like:
> soup.find("b").findNext('br').findNext('br').findNext('br').contents[0].strip()
but it is not working
- I am unable to differentiate between the last 2 persons (WONDRAK and GERSCH) because they are both contained inside the same "p" tag.
Any help would be extremely useful! Thanks in advance!
答案1
得分: 1
以下是您要翻译的代码部分:
这是一种情况,处理数据更像纯文本而不是HTML可能更容易,特别是在最初提取所需元素之后。原因是HTML的格式不太适合解析,它不遵循非常统一的模式。[html5lib](https://pypi.org/project/html5lib/)包通常比`html.parser`更好地处理格式不佳的HTML,但在这种情况下没有显著帮助。
import re
from typing import Collection, Iterator
from bs4 import BeautifulSoup
def iter_lines(soup: BeautifulSoup, ignore: Collection[str] = ()) -> Iterator[str]:
for sibling in soup.find('b').next_siblings:
for block in sibling.stripped_strings:
block_str = ' '.join(filter(None, (line.strip() for line in block.split('\n')))
if block_str and block_str not in ignore:
yield block_str
def group_people(soup: BeautifulSoup, ignore: Collection[str] = ()) -> list[list[str]]:
zip_code_pattern = re.compile(r', \d+$')
people = []
person = []
for line in iter_lines(soup, ignore):
person.append(line)
if zip_code_pattern.search(line):
people.append(person)
person = []
return people
def normalize_person(raw_person: list[str]) -> dict[str, str | None]:
return {
'NAME': raw_person[0],
'PROFESSION': raw_person[1] if len(raw_person) > 4 else None,
'DEPARTMENT': next((line for line in raw_person if 'DEPARTMENT' in line), None),
'INSTITUTION': raw_person[-2],
'LOCATION': raw_person[-1],
}
raw_people = group_people(soup, ignore={'SCIENTIFIC REVIEW OFFICER'})
normalized = [normalize_person(person) for person in raw_people]
这段代码处理HTML数据,提取和解析其中的信息。如果您需要进一步的解释或有其他问题,请随时提出。
英文:
This is a case where it may be easier to handle processing the data more as plain text than as HTML, after initially extracting the element you're looking for. The reason is that the HTML is not very well formatted for parsing / it doesn't follow a very uniform pattern. The html5lib package generally handles poorly formatted html better than html.parser
, but it didn't help significantly in this case.
import re
from typing import Collection, Iterator
from bs4 import BeautifulSoup
def iter_lines(soup: BeautifulSoup, ignore: Collection[str] = ()) -> Iterator[str]:
for sibling in soup.find('b').next_siblings:
for block in sibling.stripped_strings:
block_str = ' '.join(filter(None, (line.strip() for line in block.split('\n'))))
if block_str and block_str not in ignore:
yield block_str
def group_people(soup: BeautifulSoup, ignore: Collection[str] = ()) -> list[list[str]]:
zip_code_pattern = re.compile(r', \d+$')
people = []
person = []
for line in iter_lines(soup, ignore):
person.append(line)
if zip_code_pattern.search(line):
people.append(person)
person = []
return people
def normalize_person(raw_person: list[str]) -> dict[str, str | None]:
return {
'NAME': raw_person[0],
'PROFESSION': raw_person[1] if len(raw_person) > 4 else None,
'DEPARTMENT': next((line for line in raw_person if 'DEPARTMENT' in line), None),
'INSTITUTION': raw_person[-2],
'LOCATION': raw_person[-1],
}
raw_people = group_people(soup, ignore={'SCIENTIFIC REVIEW OFFICER'})
normalized = [normalize_person(person) for person in raw_people]
This works with both BeautifulSoup(content, 'html.parser')
and BeautifulSoup(content, 'html5lib')
.
The iter_lines
function finds the first <b>
tag like you did before, and then yields a single string for each line that is displayed in a browser.
The group_people
function groups the lines into separate people, using the zip code at the end to indicate that that person's entry is complete. It may be possible to combine this function with iter_lines
and skip the regex, but this was slightly easier. Better formatted html would be more conducive to that approach.
The ignore
parameter was used to skip the SCIENTIFIC REVIEW OFFICER
header above the last person on that page.
Lastly, the normalize_person
function attempts to interpret what each line for a given person means. The name, institution, and location appear to be fairly consistent, but I took some liberties with profession and department to use None
when it seemed like a value did not exist. Those decisions were only made based on the particular page you linked to - you may need to adjust those for other pages. It uses negative indexes for the institution and location because the number of lines that existed for each person's data was variable.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论