2023年3月7日 05:04:01go评论73阅读模式

英文:

Python BeautifulSoup find_all() method return unnecessary element

问题

我有问题使用find_all()方法来提取元素。

我正在寻找<li class='list-row'>.....</li>标签，但在提取后，它返回带有不同类的<li class='list-row reach-list'>标签。

我还尝试了select()方法。

以下是Python代码：

with open('index.html', 'r') as f:
    contents = f.read()
    soup = BeautifulSoup(contents, "html.parser")
    main_block = soup.find('ul', class_='list')
    for li in main_block.find_all('li', class_='list-row'):
        print(li.prettify())

以下是HTML文件：
index.html

&lt;ul class=&quot;list&quot;&gt;
 &lt;li class=&quot;list-row&quot;&gt;
  &lt;h2&gt;
   &lt;a href=&quot;/praca/emis/O4533184&quot; id=&quot;offer4533184&quot;&gt;
    &lt;span class=&quot;title&quot;&gt;
     Senior Developer (HTML, React, VUE.js, C#, SQL)
    &lt;/span&gt;
   &lt;/a&gt;
  &lt;/h2&gt;
 &lt;/li&gt;
 &lt;li class=&quot;list-row reach-list&quot;&gt;
  &lt;ul class=&quot;list&quot;&gt;
    &lt;span class=&quot;employer&quot;&gt;
     IT lions consulting a.s.
    &lt;/span&gt;
   &lt;/li&gt;
  &lt;/ul&gt;
 &lt;/li&gt;
&lt;/ul&gt;

英文:

I'm having trouble with scrapping elements with the find_all() method.

I am looking for the <li class='list-row'>.....</li> tag but
after scrapping it returns <li class='list-row reach-list'> tags with different classes too.

I tried with the select() method too.

Here's the python code:

with open(&#39;index.html&#39;, &#39;r&#39;) as f:
     contents = f.read()
    soup = BeautifulSoup(html,&quot;html.parser&quot;)
    main_block = conn(limit_txt,limit).find(&#39;ul&#39;, class_=&#39;list&#39;)
    for li in main_block.find_all(&#39;li&#39;,class_=&#39;list-row&#39;):
        print(li.prettify())

Here's the html file:
index.html

&lt;ul class=&quot;list&quot;&gt;
 &lt;li class=&quot;list-row&quot;&gt;
  &lt;h2&gt;
   &lt;a href=&quot;/praca/emis/O4533184&quot; id=&quot;offer4533184&quot;&gt;
    &lt;span class=&quot;title&quot;&gt;
     Senior Developer (HTML, React, VUE.js, C#, SQL)
    &lt;/span&gt;
   &lt;/a&gt;
  &lt;/h2&gt;
 &lt;/li&gt;
 &lt;li class=&quot;list-row reach-list&quot;&gt;
  &lt;ul class=&quot;list&quot;&gt;
    &lt;span class=&quot;employer&quot;&gt;
     IT lions consulting a.s.
    &lt;/span&gt;
   &lt;/li&gt;
  &lt;/ul&gt;
 &lt;/li&gt;
&lt;/ul&gt;

答案1

得分: 0

以下是您要的代码部分的翻译：

from bs4 import BeautifulSoup

html_doc = '''
<ul class="list">
 <li class="list-row">
  <h2>
   <a href="/praca/emis/O4533184" id="offer4533184">
    <span class="title">
     Senior Developer (HTML, React, VUE.js, C#, SQL)
    </span>
   </a>
  </h2>
 </li>
 <li class="list-row reach-list">
  <ul class="list">
    <span class="employer">
     IT lions consulting a.s.
    </span>
   </li>
  </ul>
 </li>
</ul>'''

soup = BeautifulSoup(html_doc, 'html.parser')

for li in soup.select('.list-row:has(h2)'):
    print(li)

输出：

<li class="list-row">
<h2>
<a href="/praca/emis/O4533184" id="offer4533184">
<span class="title">
     Senior Developer (HTML, React, VUE.js, C#, SQL)
    </span>
</a>
</h2>
</li>

或者，要选择只有标题的 <li>：.list-row:has(.title)

英文:

You can specify that you only want <li> tags which contains <h2> element (for example):

from bs4 import BeautifulSoup

html_doc = &#39;&#39;&#39;\
&lt;ul class=&quot;list&quot;&gt;
 &lt;li class=&quot;list-row&quot;&gt;
  &lt;h2&gt;
   &lt;a href=&quot;/praca/emis/O4533184&quot; id=&quot;offer4533184&quot;&gt;
    &lt;span class=&quot;title&quot;&gt;
     Senior Developer (HTML, React, VUE.js, C#, SQL)
    &lt;/span&gt;
   &lt;/a&gt;
  &lt;/h2&gt;
 &lt;/li&gt;
 &lt;li class=&quot;list-row reach-list&quot;&gt;
  &lt;ul class=&quot;list&quot;&gt;
    &lt;span class=&quot;employer&quot;&gt;
     IT lions consulting a.s.
    &lt;/span&gt;
   &lt;/li&gt;
  &lt;/ul&gt;
 &lt;/li&gt;
&lt;/ul&gt;&#39;&#39;&#39;

soup = BeautifulSoup(html_doc, &#39;html.parser&#39;)

for li in soup.select(&#39;.list-row:has(h2)&#39;):
    print(li)

Prints:

&lt;li class=&quot;list-row&quot;&gt;
&lt;h2&gt;
&lt;a href=&quot;/praca/emis/O4533184&quot; id=&quot;offer4533184&quot;&gt;
&lt;span class=&quot;title&quot;&gt;
     Senior Developer (HTML, React, VUE.js, C#, SQL)
    &lt;/span&gt;
&lt;/a&gt;
&lt;/h2&gt;
&lt;/li&gt;

Or: To select only <li> with titles: '.list-row:has(.title)'

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python BeautifulSoup的find_all()方法返回不必要的元素。

问题

答案1

Python ruamel.yaml库在不期望的地方添加了新行。

python字符串解析问题在将SQL命令保存到文件时

为什么 justify-self: flex-start 没有将 div 移动到列的底部？

将文本混合到背景中以隐藏它。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论