英文:
Eaxtract area from html parsed by BeautifulSoup
问题
提取的HTML部分如下所示:
<div style="margin-bottom:2px;"><strong>第一部分:</strong></div>
<ul>
<li>这是第一部分。</li>
<li>请提取我</li>
</ul>
<div style="margin-bottom:2px;"><strong>第二部分:</strong></div>
<ul>
<li>这是第二部分</li>
<li>这不需要被提取</li>
</ul>
实际上,我只想提取第一部分。也就是提取“这是第一部分”和“请提取我”。我已经尝试了使用BeautifulSoup和文本处理的方法,但我认为这不是正确的方法,因为它只提取了“第一部分:”。
soup = BeautifulSoup(html_document, 'html.parser')
part1 = soup(text=lambda t: "第一部分:" in t)
part1
类似下面的代码(列表推导式)不起作用,因为它也包括了第二部分:
for ul in soup:
for li in soup.findAll('li'):
print(li)
实际上,我只想提取第一个名为“第一部分:”的强调标签。
英文:
The html looks as follows:
<div style="margin-bottom:2px;"><strong>Part1:</strong></div>
<ul>
<li>This is part one.</li>
<li>Please extract me</li>
</ul>
&nbsp;
<div style="margin-bottom:2px;"><strong>PartTwo:</strong></div>
<ul>
<li>This is part 2</li>
<li>This has not to be extracted</li>
</ul>
In fact I just want to extract Part1. Means extracting
"This is part one" AND "Please extract me".
I have tackled the problem with soup and text but I think this is not the correct approach as it only extracts "Part1: "..:
soup = BeautifulSoup(html_document, 'html.parser')
part1 = soup(text=lambda t: "Part1:" in t.text)
part1
And something as following (list comprehension) does not work as it also includes PartTwo:
for ul in soup:
for li in soup.findAll('li'):
print(li)
So in fact I only want to extract the first strong tag with name "Part1:".
答案1
得分: 1
这样尝试一下如何:
```python
from bs4 import BeautifulSoup
html_sample = """<div style="margin-bottom:2px;"><strong>第一部分:</strong></div>
<ul>
<li>这是第一部分。</li>
<li>请提取我</li>
</ul>
&nbsp;
<div style="margin-bottom:2px;"><strong>第二部分:</strong></div>
<ul>
<li>这是第二部分</li>
<li>这不需要被提取</li>
</ul>"""
soup = (
BeautifulSoup(html_sample, "lxml")
.select_one("div[style='margin-bottom:2px;'] + ul")
.select("li")
)
print("\n".join([li.getText() for li in soup]))
输出:
这是第一部分。
请提取我
英文:
How about trying this:
from bs4 import BeautifulSoup
html_sample = """<div style="margin-bottom:2px;"><strong>Part1:</strong></div>
<ul>
<li>This is part one.</li>
<li>Please extract me</li>
</ul>
&nbsp;
<div style="margin-bottom:2px;"><strong>PartTwo:</strong></div>
<ul>
<li>This is part 2</li>
<li>This has not to be extracted</li>
</ul>"""
soup = (
BeautifulSoup(html_sample, "lxml")
.select_one("div[style='margin-bottom:2px;'] + ul")
.select("li")
)
print("\n".join([li.getText() for li in soup]))
Output:
This is part one.
Please extract me
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论