BeautifulSoup — extracting both "td" objects without class (_class = None or False) and other class types

huangapple go评论53阅读模式
英文:

BeautifulSoup -- extracting both "td" objects without class (_class = None or False) and other class types

问题

我正在尝试从一个包含 td 对象的网站上进行数据抓取。其中一些没有类,可以使用以下方法提取:

object.find_all("td", class_=None)

另一些有一个名为 sem_dados 的类,可以使用以下方法提取:

object.find_all("td", class_="sem_dados")

主要问题是:我不能同时执行这两个操作。例如,

object.find_all("td", class_=[None, "sem_dados"])

不会返回没有类的 td 对象。这似乎是由于列表中的 NoneFalse 的行为而引起的问题,因为

object.find_all("td", class_=[None])

也会返回一个空列表。

有人知道如何更改语法,以便同时调用这两个操作吗?提取的顺序很重要。我可以手动重新排序,但我相信一定有一种语法可以实现我所尝试的操作。

尝试了许多不同的语法,但仍然无法使其正常工作。

英文:

I am trying to scrap from a website that has td objects. Some of those have no class, which I can extract with

object.find_all("td", class_=None)

And others have a class called sem_dados, which I can extract using

object.find_all("td", class_="sem_dados")

Main issue is: I can't do both at the same time. For instance,

object.find_all("td", class_=[None, "sem_dados"])

will not return the td objects that have no class. This seems to be a problem with the None or False behavior within a list, since

object.find_all("td", class_=[None])

Will also return an empty list.

Anyone knows how to change the syntax so I can call both together? The ordering of the extraction would be important. I could manually reorder, but I believe there must be a syntax to do what I am trying to do.

Tried many different syntaxes, but still couldn't get something working.

答案1

得分: 0

也许你可以使用自定义的 lambda 函数:

from bs4 import BeautifulSoup

html_doc = '''\
<td class="sem_dados">I want this 1</td>
<td class="other">I don't want this</td>
<td>I want this 2</td>'''

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all('td', class_=lambda c: not c or 'sem_dados' == c))

输出结果:

[<td class="sem_dados">I want this 1</td>, <td>I want this 2</td>]
英文:

Maybe you can use custom lambda function:

from bs4 import BeautifulSoup

html_doc = &#39;&#39;&#39;\
&lt;td class=&quot;sem_dados&quot;&gt;I want this 1&lt;/td&gt;
&lt;td class=&quot;other&quot;&gt;I don&#39;t want this&lt;/td&gt;
&lt;td&gt;I want this 2&lt;/td&gt;&#39;&#39;&#39;

soup = BeautifulSoup(html_doc, &#39;html.parser&#39;)

print(soup.find_all(&#39;td&#39;, class_=lambda c: not c or &#39;sem_dados&#39; == c))

Prints:

[&lt;td class=&quot;sem_dados&quot;&gt;I want this 1&lt;/td&gt;, &lt;td&gt;I want this 2&lt;/td&gt;]

huangapple
  • 本文由 发表于 2023年2月27日 08:21:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75575860.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定