2023年2月27日 08:21:53go评论64阅读模式

英文:

BeautifulSoup -- extracting both "td" objects without class (_class = None or False) and other class types

问题

我正在尝试从一个包含 td 对象的网站上进行数据抓取。其中一些没有类，可以使用以下方法提取：

object.find_all("td", class_=None)

另一些有一个名为 sem_dados 的类，可以使用以下方法提取：

object.find_all("td", class_="sem_dados")

主要问题是：我不能同时执行这两个操作。例如，

object.find_all("td", class_=[None, "sem_dados"])

不会返回没有类的 td 对象。这似乎是由于列表中的 None 或 False 的行为而引起的问题，因为

object.find_all("td", class_=[None])

也会返回一个空列表。

有人知道如何更改语法，以便同时调用这两个操作吗？提取的顺序很重要。我可以手动重新排序，但我相信一定有一种语法可以实现我所尝试的操作。

尝试了许多不同的语法，但仍然无法使其正常工作。

英文:

I am trying to scrap from a website that has td objects. Some of those have no class, which I can extract with

object.find_all("td", class_=None)

And others have a class called sem_dados, which I can extract using

object.find_all("td", class_="sem_dados")

Main issue is: I can't do both at the same time. For instance,

object.find_all("td", class_=[None, "sem_dados"])

will not return the td objects that have no class. This seems to be a problem with the None or False behavior within a list, since

object.find_all("td", class_=[None])

Will also return an empty list.

Anyone knows how to change the syntax so I can call both together? The ordering of the extraction would be important. I could manually reorder, but I believe there must be a syntax to do what I am trying to do.

Tried many different syntaxes, but still couldn't get something working.

答案1

得分: 0

也许你可以使用自定义的 lambda 函数：

from bs4 import BeautifulSoup

html_doc = '''\
<td class="sem_dados">I want this 1</td>
<td class="other">I don't want this</td>
<td>I want this 2</td>'''

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all('td', class_=lambda c: not c or 'sem_dados' == c))

输出结果：

[<td class="sem_dados">I want this 1</td>, <td>I want this 2</td>]

英文:

Maybe you can use custom lambda function:

from bs4 import BeautifulSoup

html_doc = &#39;&#39;&#39;\
&lt;td class=&quot;sem_dados&quot;&gt;I want this 1&lt;/td&gt;
&lt;td class=&quot;other&quot;&gt;I don&#39;t want this&lt;/td&gt;
&lt;td&gt;I want this 2&lt;/td&gt;&#39;&#39;&#39;

soup = BeautifulSoup(html_doc, &#39;html.parser&#39;)

print(soup.find_all(&#39;td&#39;, class_=lambda c: not c or &#39;sem_dados&#39; == c))

Prints:

[&lt;td class=&quot;sem_dados&quot;&gt;I want this 1&lt;/td&gt;, &lt;td&gt;I want this 2&lt;/td&gt;]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

BeautifulSoup — extracting both "td" objects without class (_class = None or False) and other class types

问题

答案1

如何强制 ElementTree 在特定目录中查找 XML 文件？

在Python中弹出字典时出现错误。

How can I make my external .js file work properly with my HTML file?

如何根据多个列的特定条件创建具有值作为标题名称的新列？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论