如何迭代HTML文件并将特定数据解析到数据框中?

huangapple go评论76阅读模式
英文:

How to iterate HTML file and parse specific data to Dataframe?

问题

我已经查看了各种方法,从BeautifulSoup到XML解析器,我认为一定有一种更简单的方法来迭代遍历HTML文件,将信息解析成一个数据框表格。有许多带有特定部分标题的信息:

<h2 class="chapter-header-western">CHAPTER 1</h2>
	<p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
	<p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
	<p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
	</p>
	<p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
	<b>3 </b>text
	</p>
	<p class="western" style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
	<b>5 </b>text<b>6 </b>text
	</p>

HTML 有点混乱,因为它是从一个docx文件转换而来的,但我只需要解析跟随粗体数字<b>#</b>的每一段文本到它自己的行:

章节 编号 文本
1 1 文本
1 2 文本
1 3 文本

也许我需要为<b>#</b>创建一个标签作为分隔符?

我尝试使用BeautifulSoup的find_all,但这只返回标签之间的字符串,我需要一种方法来返回在一组标签之后的文本。

英文:

I have looked over various methods from BeautifulSoup to XML parsers and I think that there must be a simpler way to iterate over an HTML file to parse information into a dataframe table. There is a lot of information with specific section headers:

<h2 class="chapter-header-western">CHAPTER 1</h2>
	<p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
	<p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
	<p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
	</p>
	<p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
	<b>3 </b>text
	</p>
	<p class="western" style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
	<b>5 </b>text<b>6 </b>text
	</p>

The html is a bit of a mess being converted from a docx file, but all I need to do is parse each piece of text following the bold numbers <b>#</b> into its own row:

Chapter Number Text
1 1 text
1 2 text
1 3 text

Perhaps I need to make a tag for <b>#</b> as a delineation?

I tried using BeautifulSoup find_all but this only returns strings between tags, and I need a way to return the text following a set of tags.

答案1

得分: 0

基于您的示例,您可以选择所有的<b>元素,并检查文本是否为数字 - 使用find_previous()next_sibling从左侧和右侧选择所需的信息:

for e in soup.select('b'):
    if e.get_text(strip=True).isnumeric():
        data.append({
            'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
            'number': e.get_text(strip=True),
            'text': e.next_sibling.strip() if e.next_sibling else None
        })

示例

from bs4 import BeautifulSoup
html = '''
<h2 class="chapter-header-western">CHAPTER 1</h2>
    <p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
    <p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
    <p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
    </p>
    <p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
    <b>3 </b>text
    </p>
    <p class="western" style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
    <b>5 </b>text<b>6 </b>text
    </p>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('b'):
    if e.get_text(strip=True).isnumeric():
        data.append({
            'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
            'number': e.get_text(strip=True),
            'text': e.next_sibling.strip() if e.next_sibling else None
        })

pd.DataFrame(data)

输出

chapter number text
0 1 1 text
1 1 2 text
2 1 3 text
3 1 4 text
4 1 5 text
5 1 6 text
英文:

Based on your example you could select all &lt;b&gt; elements and check if the text isnumeric() - Use find_previous() and next_sibling to select necessary information from left and right:

for e in soup.select(&#39;b&#39;):
    if e.get_text(strip=True).isnumeric():
        data.append({
            &#39;chapter&#39;: e.find_previous(&#39;h2&#39;).get_text(strip=True).split()[-1],
            &#39;number&#39;: e.get_text(strip=True),
            &#39;text&#39;: e.next_sibling.strip() if e.next_sibling else None
        })

Example

from bs4 import BeautifulSoup
html = &#39;&#39;&#39;
&lt;h2 class=&quot;chapter-header-western&quot;&gt;CHAPTER 1&lt;/h2&gt;
    &lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;1&lt;/b&gt;text
    &lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;Header&lt;/b&gt;&lt;/p&gt;
    &lt;p align=&quot;left&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;2&lt;/b&gt;text
    &lt;/p&gt;
    &lt;p align=&quot;left&quot; style=&quot;line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in&quot;&gt;
    &lt;b&gt;3 &lt;/b&gt;text
    &lt;/p&gt;
    &lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;text &lt;b&gt;4 &lt;/b&gt;text
    &lt;b&gt;5 &lt;/b&gt;text&lt;b&gt;6 &lt;/b&gt;text
    &lt;/p&gt;
&#39;&#39;&#39;
soup = BeautifulSoup(html)

data = []

for e in soup.select(&#39;b&#39;):
    if e.get_text(strip=True).isnumeric():
        data.append({
            &#39;chapter&#39;: e.find_previous(&#39;h2&#39;).get_text(strip=True).split()[-1],
            &#39;number&#39;: e.get_text(strip=True),
            &#39;text&#39;: e.next_sibling.strip() if e.next_sibling else None
        })

pd.DataFrame(data)

Output

chapter number text
0 1 1 text
1 1 2 text
2 1 3 text
3 1 4 text
4 1 5 text
5 1 6 text

huangapple
  • 本文由 发表于 2023年1月9日 07:58:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75052150.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定