英文:
How to iterate HTML file and parse specific data to Dataframe?
问题
我已经查看了各种方法,从BeautifulSoup
到XML解析器,我认为一定有一种更简单的方法来迭代遍历HTML文件,将信息解析成一个数据框表格。有许多带有特定部分标题的信息:
<h2 class="chapter-header-western">CHAPTER 1</h2>
<p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
<p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
<p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
</p>
<p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
<b>3 </b>text
</p>
<p class="western" style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
<b>5 </b>text<b>6 </b>text
</p>
HTML 有点混乱,因为它是从一个docx文件转换而来的,但我只需要解析跟随粗体数字<b>#</b>
的每一段文本到它自己的行:
章节 | 编号 | 文本 |
---|---|---|
1 | 1 | 文本 |
1 | 2 | 文本 |
1 | 3 | 文本 |
也许我需要为<b>#</b>
创建一个标签作为分隔符?
我尝试使用BeautifulSoup的find_all
,但这只返回标签之间的字符串,我需要一种方法来返回在一组标签之后的文本。
英文:
I have looked over various methods from BeautifulSoup
to XML parsers and I think that there must be a simpler way to iterate over an HTML file to parse information into a dataframe table. There is a lot of information with specific section headers:
<h2 class="chapter-header-western">CHAPTER 1</h2>
<p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
<p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
<p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
</p>
<p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
<b>3 </b>text
</p>
<p class="western" style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
<b>5 </b>text<b>6 </b>text
</p>
The html is a bit of a mess being converted from a docx file, but all I need to do is parse each piece of text following the bold numbers <b>#</b>
into its own row:
Chapter | Number | Text |
---|---|---|
1 | 1 | text |
1 | 2 | text |
1 | 3 | text |
Perhaps I need to make a tag for <b>#</b>
as a delineation?
I tried using BeautifulSoup find_all but this only returns strings between tags, and I need a way to return the text following a set of tags.
答案1
得分: 0
基于您的示例,您可以选择所有的<b>
元素,并检查文本是否为数字 - 使用find_previous()
和next_sibling
从左侧和右侧选择所需的信息:
for e in soup.select('b'):
if e.get_text(strip=True).isnumeric():
data.append({
'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
'number': e.get_text(strip=True),
'text': e.next_sibling.strip() if e.next_sibling else None
})
示例
from bs4 import BeautifulSoup
html = '''
<h2 class="chapter-header-western">CHAPTER 1</h2>
<p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
<p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
<p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
</p>
<p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
<b>3 </b>text
</p>
<p class="western" style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
<b>5 </b>text<b>6 </b>text
</p>
'''
soup = BeautifulSoup(html)
data = []
for e in soup.select('b'):
if e.get_text(strip=True).isnumeric():
data.append({
'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
'number': e.get_text(strip=True),
'text': e.next_sibling.strip() if e.next_sibling else None
})
pd.DataFrame(data)
输出
chapter | number | text | |
---|---|---|---|
0 | 1 | 1 | text |
1 | 1 | 2 | text |
2 | 1 | 3 | text |
3 | 1 | 4 | text |
4 | 1 | 5 | text |
5 | 1 | 6 | text |
英文:
Based on your example you could select all <b>
elements and check if the text isnumeric()
- Use find_previous()
and next_sibling
to select necessary information from left and right:
for e in soup.select('b'):
if e.get_text(strip=True).isnumeric():
data.append({
'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
'number': e.get_text(strip=True),
'text': e.next_sibling.strip() if e.next_sibling else None
})
Example
from bs4 import BeautifulSoup
html = '''
<h2 class="chapter-header-western">CHAPTER 1</h2>
<p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
<p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
<p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
</p>
<p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
<b>3 </b>text
</p>
<p class="western" style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
<b>5 </b>text<b>6 </b>text
</p>
'''
soup = BeautifulSoup(html)
data = []
for e in soup.select('b'):
if e.get_text(strip=True).isnumeric():
data.append({
'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
'number': e.get_text(strip=True),
'text': e.next_sibling.strip() if e.next_sibling else None
})
pd.DataFrame(data)
Output
chapter | number | text | |
---|---|---|---|
0 | 1 | 1 | text |
1 | 1 | 2 | text |
2 | 1 | 3 | text |
3 | 1 | 4 | text |
4 | 1 | 5 | text |
5 | 1 | 6 | text |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论