英文:
Python BeautifulSoup find_all() method return unnecessary element
问题
我有问题使用find_all()
方法来提取元素。
我正在寻找<li class='list-row'>.....</li>
标签,但在提取后,它返回带有不同类的<li class='list-row reach-list'>
标签。
我还尝试了select()
方法。
以下是Python代码:
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
main_block = soup.find('ul', class_='list')
for li in main_block.find_all('li', class_='list-row'):
print(li.prettify())
以下是HTML文件:
index.html
<ul class="list">
<li class="list-row">
<h2>
<a href="/praca/emis/O4533184" id="offer4533184">
<span class="title">
Senior Developer (HTML, React, VUE.js, C#, SQL)
</span>
</a>
</h2>
</li>
<li class="list-row reach-list">
<ul class="list">
<span class="employer">
IT lions consulting a.s.
</span>
</li>
</ul>
</li>
</ul>
英文:
I'm having trouble with scrapping elements with the find_all() method.
I am looking for the <li class='list-row'>.....</li>
tag but
after scrapping it returns <li class='list-row reach-list'>
tags with different classes too.
I tried with the select()
method too.
Here's the python code:
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(html,"html.parser")
main_block = conn(limit_txt,limit).find('ul', class_='list')
for li in main_block.find_all('li',class_='list-row'):
print(li.prettify())
Here's the html file:
index.html
<ul class="list">
<li class="list-row">
<h2>
<a href="/praca/emis/O4533184" id="offer4533184">
<span class="title">
Senior Developer (HTML, React, VUE.js, C#, SQL)
</span>
</a>
</h2>
</li>
<li class="list-row reach-list">
<ul class="list">
<span class="employer">
IT lions consulting a.s.
</span>
</li>
</ul>
</li>
</ul>
答案1
得分: 0
以下是您要的代码部分的翻译:
from bs4 import BeautifulSoup
html_doc = '''
<ul class="list">
<li class="list-row">
<h2>
<a href="/praca/emis/O4533184" id="offer4533184">
<span class="title">
Senior Developer (HTML, React, VUE.js, C#, SQL)
</span>
</a>
</h2>
</li>
<li class="list-row reach-list">
<ul class="list">
<span class="employer">
IT lions consulting a.s.
</span>
</li>
</ul>
</li>
</ul>'''
soup = BeautifulSoup(html_doc, 'html.parser')
for li in soup.select('.list-row:has(h2)'):
print(li)
输出:
<li class="list-row">
<h2>
<a href="/praca/emis/O4533184" id="offer4533184">
<span class="title">
Senior Developer (HTML, React, VUE.js, C#, SQL)
</span>
</a>
</h2>
</li>
或者,要选择只有标题的 <li>
:.list-row:has(.title)
英文:
You can specify that you only want <li>
tags which contains <h2>
element (for example):
from bs4 import BeautifulSoup
html_doc = '''\
<ul class="list">
<li class="list-row">
<h2>
<a href="/praca/emis/O4533184" id="offer4533184">
<span class="title">
Senior Developer (HTML, React, VUE.js, C#, SQL)
</span>
</a>
</h2>
</li>
<li class="list-row reach-list">
<ul class="list">
<span class="employer">
IT lions consulting a.s.
</span>
</li>
</ul>
</li>
</ul>'''
soup = BeautifulSoup(html_doc, 'html.parser')
for li in soup.select('.list-row:has(h2)'):
print(li)
Prints:
<li class="list-row">
<h2>
<a href="/praca/emis/O4533184" id="offer4533184">
<span class="title">
Senior Developer (HTML, React, VUE.js, C#, SQL)
</span>
</a>
</h2>
</li>
Or: To select only <li>
with titles: '.list-row:has(.title)'
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论