2023年1月9日 07:58:28go评论113阅读模式

英文:

How to iterate HTML file and parse specific data to Dataframe?

问题

我已经查看了各种方法，从BeautifulSoup到XML解析器，我认为一定有一种更简单的方法来迭代遍历HTML文件，将信息解析成一个数据框表格。有许多带有特定部分标题的信息：

&lt;h2 class=&quot;chapter-header-western&quot;&gt;CHAPTER 1&lt;/h2&gt;
	&lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;1&lt;/b&gt;text
	&lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;Header&lt;/b&gt;&lt;/p&gt;
	&lt;p align=&quot;left&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;2&lt;/b&gt;text
	&lt;/p&gt;
	&lt;p align=&quot;left&quot; style=&quot;line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in&quot;&gt;
	&lt;b&gt;3 &lt;/b&gt;text
	&lt;/p&gt;
	&lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;text &lt;b&gt;4 &lt;/b&gt;text
	&lt;b&gt;5 &lt;/b&gt;text&lt;b&gt;6 &lt;/b&gt;text
	&lt;/p&gt;

HTML 有点混乱，因为它是从一个docx文件转换而来的，但我只需要解析跟随粗体数字#的每一段文本到它自己的行：

章节	编号	文本
1	1	文本
1	2	文本
1	3	文本

也许我需要为#创建一个标签作为分隔符？

我尝试使用BeautifulSoup的find_all，但这只返回标签之间的字符串，我需要一种方法来返回在一组标签之后的文本。

英文:

I have looked over various methods from BeautifulSoup to XML parsers and I think that there must be a simpler way to iterate over an HTML file to parse information into a dataframe table. There is a lot of information with specific section headers:

&lt;h2 class=&quot;chapter-header-western&quot;&gt;CHAPTER 1&lt;/h2&gt;
	&lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;1&lt;/b&gt;text
	&lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;Header&lt;/b&gt;&lt;/p&gt;
	&lt;p align=&quot;left&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;2&lt;/b&gt;text
	&lt;/p&gt;
	&lt;p align=&quot;left&quot; style=&quot;line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in&quot;&gt;
	&lt;b&gt;3 &lt;/b&gt;text
	&lt;/p&gt;
	&lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;text &lt;b&gt;4 &lt;/b&gt;text
	&lt;b&gt;5 &lt;/b&gt;text&lt;b&gt;6 &lt;/b&gt;text
	&lt;/p&gt;

The html is a bit of a mess being converted from a docx file, but all I need to do is parse each piece of text following the bold numbers # into its own row:

Chapter	Number	Text
1	1	text
1	2	text
1	3	text

Perhaps I need to make a tag for # as a delineation?

I tried using BeautifulSoup find_all but this only returns strings between tags, and I need a way to return the text following a set of tags.

答案1

得分: 0

基于您的示例，您可以选择所有的元素，并检查文本是否为数字 - 使用find_previous()和next_sibling从左侧和右侧选择所需的信息：

for e in soup.select('b'):
    if e.get_text(strip=True).isnumeric():
        data.append({
            'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
            'number': e.get_text(strip=True),
            'text': e.next_sibling.strip() if e.next_sibling else None
        })

示例

from bs4 import BeautifulSoup
html = '''
<h2 class="chapter-header-western">CHAPTER 1</h2>
    <p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
    <p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
    <p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
    </p>
    <p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
    <b>3 </b>text
    </p>
    <p class="western" style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
    <b>5 </b>text<b>6 </b>text
    </p>
'''
soup = BeautifulSoup(html)
data = []
for e in soup.select('b'):
    if e.get_text(strip=True).isnumeric():
        data.append({
            'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
            'number': e.get_text(strip=True),
            'text': e.next_sibling.strip() if e.next_sibling else None
        })
pd.DataFrame(data)

输出

	chapter	number	text
0	1	1	text
1	1	2	text
2	1	3	text
3	1	4	text
4	1	5	text
5	1	6	text

英文:

Based on your example you could select all  elements and check if the text isnumeric() - Use find_previous() and next_sibling to select necessary information from left and right:

for e in soup.select(&#39;b&#39;):
    if e.get_text(strip=True).isnumeric():
        data.append({
            &#39;chapter&#39;: e.find_previous(&#39;h2&#39;).get_text(strip=True).split()[-1],
            &#39;number&#39;: e.get_text(strip=True),
            &#39;text&#39;: e.next_sibling.strip() if e.next_sibling else None
        })

Example

from bs4 import BeautifulSoup
html = &#39;&#39;&#39;
&lt;h2 class=&quot;chapter-header-western&quot;&gt;CHAPTER 1&lt;/h2&gt;
    &lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;1&lt;/b&gt;text
    &lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;Header&lt;/b&gt;&lt;/p&gt;
    &lt;p align=&quot;left&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;&lt;b&gt;2&lt;/b&gt;text
    &lt;/p&gt;
    &lt;p align=&quot;left&quot; style=&quot;line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in&quot;&gt;
    &lt;b&gt;3 &lt;/b&gt;text
    &lt;/p&gt;
    &lt;p class=&quot;western&quot; style=&quot;line-height: 100%; margin-bottom: 0.08in&quot;&gt;text &lt;b&gt;4 &lt;/b&gt;text
    &lt;b&gt;5 &lt;/b&gt;text&lt;b&gt;6 &lt;/b&gt;text
    &lt;/p&gt;
&#39;&#39;&#39;
soup = BeautifulSoup(html)
data = []
for e in soup.select(&#39;b&#39;):
    if e.get_text(strip=True).isnumeric():
        data.append({
            &#39;chapter&#39;: e.find_previous(&#39;h2&#39;).get_text(strip=True).split()[-1],
            &#39;number&#39;: e.get_text(strip=True),
            &#39;text&#39;: e.next_sibling.strip() if e.next_sibling else None
        })
pd.DataFrame(data)

Output

	chapter	number	text
0	1	1	text
1	1	2	text
2	1	3	text
3	1	4	text
4	1	5	text
5	1	6	text

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何迭代HTML文件并将特定数据解析到数据框中？

问题

答案1

示例

输出

Example

Output

限制Java、C++、Python程序的权限。

从桑基图中使用Python和Beautiful Soup（BS）抓取数据。

重新调整具有不同值的因子基于其他列的值的组。

NVidia Rapids：cuml UMAP 中的非欧几里德度量

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。