英文:
BeautifulSoup's findall with a list of names does not find targets after another target
问题
如果我执行以下代码:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html><body><p>111</p><p>before<ul><li>222</li></ul>after</p></body></html>", "lxml")
soup.find_all(["p", "li"])
我会得到以下结果:
[<p>111</p>, <p>before</p>, <li>222</li>]
我期望在结果中也找到 "after",要么作为第二个 "p" 元素的一部分,要么作为列表中的第四个项目。
这是否是预期的行为?有没有方法来检索文本 "after"?
更奇怪的是,如果我执行 print(soup.prettify())
,结果如下:
<html>
<body>
<p>
111
</p>
<p>
before
</p>
<ul>
<li>
222
</li>
</ul>
after
</body>
</html>
"ul" 和 "after" 不再是第二个 "p" 的一部分。我假设源代码可能不是有效的 HTML(?),但再次提问:
有没有办法处理这个问题,而不仅仅是丢弃 "after"?
英文:
Other questions with similar titles did not answer my question.
If I execute this:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html><body><p>111</p><p>before<ul><li>222</li></ul>after</p></body></html>", "lxml")
soup.find_all(["p", "li"])
I get this result:
[<p>111</p>, <p>before</p>, <li>222</li>]
I expected to find "after" in the result as well, either as part of the second "p" element or as a 4th item in the list.
Is this expected behaviour? Is there a way to retrieve the text "after"?
More weirdness, if I do print(soup.prettify())
, this is the result.
<html>
<body>
<p>
111
</p>
<p>
before
</p>
<ul>
<li>
222
</li>
</ul>
after
</body>
</html>
The "ul" and "after" are no longer part of the second "p". I assume that the source is not valid html (?), but again:
Is there a way to deal with this, except from just dropping "after"?
答案1
得分: 2
建议在这种情况下使用不同于 lxml
的解析器:html.parser
。lxml
比 html.parser
更严格:
soup = BeautifulSoup("<html><body><p>111</p><p>before<ul><li>222</li></ul>after</p></body></html>", "html.parser")
print(soup.find_all(["p", "li"]))
输出结果:
[<p>111</p>, <p>before<ul><li>222</li></ul>after</p>, <li>222</li>]
英文:
I suggest to use different parser than lxml
in this case: html.parser
. lxml
is more strict than html.parser
:
soup = BeautifulSoup("<html><body><p>111</p><p>before<ul><li>222</li></ul>after</p></body></html>", "html.parser")
print(soup.find_all(["p", "li"]))
Prints:
[<p>111</p>, <p>before<ul><li>222</li></ul>after</p>, <li>222</li>]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论