如何迭代HTML文件并将特定数据解析到数据框中?

huangapple go评论113阅读模式
英文:

How to iterate HTML file and parse specific data to Dataframe?

问题

我已经查看了各种方法,从BeautifulSoup到XML解析器,我认为一定有一种更简单的方法来迭代遍历HTML文件,将信息解析成一个数据框表格。有许多带有特定部分标题的信息:

  1. <h2 class="chapter-header-western">CHAPTER 1</h2>
  2. <p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
  3. <p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
  4. <p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
  5. </p>
  6. <p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
  7. <b>3 </b>text
  8. </p>
  9. <p class="western" style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
  10. <b>5 </b>text<b>6 </b>text
  11. </p>

HTML 有点混乱,因为它是从一个docx文件转换而来的,但我只需要解析跟随粗体数字<b>#</b>的每一段文本到它自己的行:

章节 编号 文本
1 1 文本
1 2 文本
1 3 文本

也许我需要为<b>#</b>创建一个标签作为分隔符?

我尝试使用BeautifulSoup的find_all,但这只返回标签之间的字符串,我需要一种方法来返回在一组标签之后的文本。

英文:

I have looked over various methods from BeautifulSoup to XML parsers and I think that there must be a simpler way to iterate over an HTML file to parse information into a dataframe table. There is a lot of information with specific section headers:

  1. <h2 class="chapter-header-western">CHAPTER 1</h2>
  2. <p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
  3. <p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
  4. <p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
  5. </p>
  6. <p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
  7. <b>3 </b>text
  8. </p>
  9. <p class="western" style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
  10. <b>5 </b>text<b>6 </b>text
  11. </p>

The html is a bit of a mess being converted from a docx file, but all I need to do is parse each piece of text following the bold numbers <b>#</b> into its own row:

Chapter Number Text
1 1 text
1 2 text
1 3 text

Perhaps I need to make a tag for <b>#</b> as a delineation?

I tried using BeautifulSoup find_all but this only returns strings between tags, and I need a way to return the text following a set of tags.

答案1

得分: 0

基于您的示例,您可以选择所有的<b>元素,并检查文本是否为数字 - 使用find_previous()next_sibling从左侧和右侧选择所需的信息:

  1. for e in soup.select('b'):
  2. if e.get_text(strip=True).isnumeric():
  3. data.append({
  4. 'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
  5. 'number': e.get_text(strip=True),
  6. 'text': e.next_sibling.strip() if e.next_sibling else None
  7. })

示例

  1. from bs4 import BeautifulSoup
  2. html = '''
  3. <h2 class="chapter-header-western">CHAPTER 1</h2>
  4. <p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
  5. <p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
  6. <p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
  7. </p>
  8. <p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
  9. <b>3 </b>text
  10. </p>
  11. <p class="western" style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
  12. <b>5 </b>text<b>6 </b>text
  13. </p>
  14. '''
  15. soup = BeautifulSoup(html)
  16. data = []
  17. for e in soup.select('b'):
  18. if e.get_text(strip=True).isnumeric():
  19. data.append({
  20. 'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
  21. 'number': e.get_text(strip=True),
  22. 'text': e.next_sibling.strip() if e.next_sibling else None
  23. })
  24. pd.DataFrame(data)

输出

chapter number text
0 1 1 text
1 1 2 text
2 1 3 text
3 1 4 text
4 1 5 text
5 1 6 text
英文:

Based on your example you could select all &lt;b&gt; elements and check if the text isnumeric() - Use find_previous() and next_sibling to select necessary information from left and right:

  1. for e in soup.select(&#39;b&#39;):
  2. if e.get_text(strip=True).isnumeric():
  3. data.append({
  4. &#39;chapter&#39;: e.find_previous(&#39;h2&#39;).get_text(strip=True).split()[-1],
  5. &#39;number&#39;: e.get_text(strip=True),
  6. &#39;text&#39;: e.next_sibling.strip() if e.next_sibling else None
  7. })

Example

  1. from bs4 import BeautifulSoup
  2. html = &#39;&#39;&#39;
  3. &lt;h2 class=&quot;chapter-header-western&quot;&gt;CHAPTER 1&lt;/h2&gt;
  4. &lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;1&lt;/b&gt;text
  5. &lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;Header&lt;/b&gt;&lt;/p&gt;
  6. &lt;p align=&quot;left&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;2&lt;/b&gt;text
  7. &lt;/p&gt;
  8. &lt;p align=&quot;left&quot; style=&quot;line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in&quot;&gt;
  9. &lt;b&gt;3 &lt;/b&gt;text
  10. &lt;/p&gt;
  11. &lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;text &lt;b&gt;4 &lt;/b&gt;text
  12. &lt;b&gt;5 &lt;/b&gt;text&lt;b&gt;6 &lt;/b&gt;text
  13. &lt;/p&gt;
  14. &#39;&#39;&#39;
  15. soup = BeautifulSoup(html)
  16. data = []
  17. for e in soup.select(&#39;b&#39;):
  18. if e.get_text(strip=True).isnumeric():
  19. data.append({
  20. &#39;chapter&#39;: e.find_previous(&#39;h2&#39;).get_text(strip=True).split()[-1],
  21. &#39;number&#39;: e.get_text(strip=True),
  22. &#39;text&#39;: e.next_sibling.strip() if e.next_sibling else None
  23. })
  24. pd.DataFrame(data)

Output

chapter number text
0 1 1 text
1 1 2 text
2 1 3 text
3 1 4 text
4 1 5 text
5 1 6 text

huangapple
  • 本文由 发表于 2023年1月9日 07:58:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75052150.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定