2023年3月8日 17:17:14go评论135阅读模式

英文:

Eaxtract area from html parsed by BeautifulSoup

问题

提取的HTML部分如下所示：

<div style="margin-bottom:2px;"><strong>第一部分：</strong></div>
<ul>
<li>这是第一部分。</li>
<li>请提取我</li>
</ul>
 
<div style="margin-bottom:2px;"><strong>第二部分：</strong></div>
<ul>
<li>这是第二部分</li>
<li>这不需要被提取</li>

</ul>


实际上，我只想提取第一部分。也就是提取“这是第一部分”和“请提取我”。我已经尝试了使用BeautifulSoup和文本处理的方法，但我认为这不是正确的方法，因为它只提取了“第一部分：”。

soup = BeautifulSoup(html_document, 'html.parser')

part1 = soup(text=lambda t: "第一部分：" in t)
part1

类似下面的代码（列表推导式）不起作用，因为它也包括了第二部分：

for ul in soup:
for li in soup.findAll('li'):
print(li)

实际上，我只想提取第一个名为“第一部分：”的强调标签。

英文:

The html looks as follows:

&lt;div style=&quot;margin-bottom:2px;&quot;&gt;&lt;strong&gt;Part1:&lt;/strong&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;This is part one.&lt;/li&gt;
&lt;li&gt;Please extract me&lt;/li&gt;
&lt;/ul&gt;
&amp;nbsp;
&lt;div style=&quot;margin-bottom:2px;&quot;&gt;&lt;strong&gt;PartTwo:&lt;/strong&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;This is part 2&lt;/li&gt;
&lt;li&gt;This has not to be extracted&lt;/li&gt;

&lt;/ul&gt;

In fact I just want to extract Part1. Means extracting
"This is part one" AND "Please extract me".
I have tackled the problem with soup and text but I think this is not the correct approach as it only extracts "Part1: "..:

soup = BeautifulSoup(html_document, &#39;html.parser&#39;)

part1 = soup(text=lambda t: &quot;Part1:&quot; in t.text)
part1

And something as following (list comprehension) does not work as it also includes PartTwo:

for ul in soup:
    for li in soup.findAll(&#39;li&#39;):
        print(li)

So in fact I only want to extract the first strong tag with name "Part1:".

答案1

得分: 1

这样尝试一下如何：

```python
from bs4 import BeautifulSoup

html_sample = &quot;&quot;&quot;&lt;div style=&quot;margin-bottom:2px;&quot;&gt;&lt;strong&gt;第一部分：&lt;/strong&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;这是第一部分。&lt;/li&gt;
&lt;li&gt;请提取我&lt;/li&gt;
&lt;/ul&gt;
&amp;nbsp;
&lt;div style=&quot;margin-bottom:2px;&quot;&gt;&lt;strong&gt;第二部分：&lt;/strong&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;这是第二部分&lt;/li&gt;
&lt;li&gt;这不需要被提取&lt;/li&gt;

&lt;/ul&gt;&quot;&quot;&quot;

soup = (
    BeautifulSoup(html_sample, &quot;lxml&quot;)
    .select_one(&quot;div[style=&#39;margin-bottom:2px;&#39;] + ul&quot;)
    .select(&quot;li&quot;)
)
print(&quot;\n&quot;.join([li.getText() for li in soup]))

输出：

这是第一部分。
请提取我

英文:

How about trying this:

from bs4 import BeautifulSoup

html_sample = &quot;&quot;&quot;&lt;div style=&quot;margin-bottom:2px;&quot;&gt;&lt;strong&gt;Part1:&lt;/strong&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;This is part one.&lt;/li&gt;
&lt;li&gt;Please extract me&lt;/li&gt;
&lt;/ul&gt;
&amp;nbsp;
&lt;div style=&quot;margin-bottom:2px;&quot;&gt;&lt;strong&gt;PartTwo:&lt;/strong&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;This is part 2&lt;/li&gt;
&lt;li&gt;This has not to be extracted&lt;/li&gt;

&lt;/ul&gt;&quot;&quot;&quot;

soup = (
    BeautifulSoup(html_sample, &quot;lxml&quot;)
    .select_one(&quot;div[style=&#39;margin-bottom:2px;&#39;] + ul&quot;)
    .select(&quot;li&quot;)
)
print(&quot;\n&quot;.join([li.getText() for li in soup]))

Output:

This is part one.
Please extract me

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从由BeautifulSoup解析的HTML中提取区域

问题

答案1

有没有办法让这个文档搜索功能更快？

如何在CMake安装中创建一个Python 3虚拟环境？

PyCharm运行一个Flask应用程序，但在Python 3.11中无法成功进行调试。

.drop(columns=[]) 在 CSV 和数据框中存在列时返回 KeyError。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论