2023年6月8日 04:28:34go评论91阅读模式

英文:

BeautifulSoup's findall with a list of names does not find targets after another target

问题

如果我执行以下代码：

from bs4 import BeautifulSoup
soup = BeautifulSoup("<html><body><p>111</p><p>before<ul><li>222</li></ul>after</p></body></html>", "lxml")
soup.find_all(["p", "li"])

我会得到以下结果：

[<p>111</p>, <p>before</p>, <li>222</li>]

我期望在结果中也找到 "after"，要么作为第二个 "p" 元素的一部分，要么作为列表中的第四个项目。

这是否是预期的行为？有没有方法来检索文本 "after"？

更奇怪的是，如果我执行 print(soup.prettify())，结果如下：

<html>
 <body>
  <p>
   111
  </p>
  <p>
   before
  </p>
  <ul>
   <li>
    222
   </li>
  </ul>
  after
 </body>
</html>

"ul" 和 "after" 不再是第二个 "p" 的一部分。我假设源代码可能不是有效的 HTML（？），但再次提问：

有没有办法处理这个问题，而不仅仅是丢弃 "after"？

英文:

Other questions with similar titles did not answer my question.

If I execute this:

from  bs4 import BeautifulSoup
soup = BeautifulSoup(&quot;&lt;html&gt;&lt;body&gt;&lt;p&gt;111&lt;/p&gt;&lt;p&gt;before&lt;ul&gt;&lt;li&gt;222&lt;/li&gt;&lt;/ul&gt;after&lt;/p&gt;&lt;/body&gt;&lt;/html&gt;&quot;, &quot;lxml&quot;)
soup.find_all([&quot;p&quot;, &quot;li&quot;])

I get this result:

[&lt;p&gt;111&lt;/p&gt;, &lt;p&gt;before&lt;/p&gt;, &lt;li&gt;222&lt;/li&gt;]

I expected to find "after" in the result as well, either as part of the second "p" element or as a 4th item in the list.

Is this expected behaviour? Is there a way to retrieve the text "after"?

More weirdness, if I do print(soup.prettify()), this is the result.

&lt;html&gt;
 &lt;body&gt;
  &lt;p&gt;
   111
  &lt;/p&gt;
  &lt;p&gt;
   before
  &lt;/p&gt;
  &lt;ul&gt;
   &lt;li&gt;
    222
   &lt;/li&gt;
  &lt;/ul&gt;
  after
 &lt;/body&gt;
&lt;/html&gt;

The "ul" and "after" are no longer part of the second "p". I assume that the source is not valid html (?), but again:

Is there a way to deal with this, except from just dropping "after"?

答案1

得分: 2

建议在这种情况下使用不同于 lxml 的解析器：html.parser。lxml 比 html.parser 更严格：

soup = BeautifulSoup("<html><body><p>111</p><p>before<ul><li>222</li></ul>after</p></body></html>", "html.parser")
print(soup.find_all(["p", "li"]))

输出结果：

[<p>111</p>, <p>before<ul><li>222</li></ul>after</p>, <li>222</li>]

英文:

I suggest to use different parser than lxml in this case: html.parser. lxml is more strict than html.parser:

soup = BeautifulSoup(&quot;&lt;html&gt;&lt;body&gt;&lt;p&gt;111&lt;/p&gt;&lt;p&gt;before&lt;ul&gt;&lt;li&gt;222&lt;/li&gt;&lt;/ul&gt;after&lt;/p&gt;&lt;/body&gt;&lt;/html&gt;&quot;, &quot;html.parser&quot;)
print(soup.find_all([&quot;p&quot;, &quot;li&quot;]))

Prints:

[&lt;p&gt;111&lt;/p&gt;, &lt;p&gt;before&lt;ul&gt;&lt;li&gt;222&lt;/li&gt;&lt;/ul&gt;after&lt;/p&gt;, &lt;li&gt;222&lt;/li&gt;]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

BeautifulSoup的find_all方法使用名称列表时，无法找到另一个目标之后的目标。

问题

答案1

将if语句的条件赋给一个变量。

移动设备中CSS悬停下划线消失

如何在JavaScript中获取h1元素的innerText，而不包括其子元素的innerText？

How can I separate symbols [">", "<", ">=", "<="], numeric value and unit from a string by using regular expression in Python?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。