英文:
BeautifulSoup -- extracting both "td" objects without class (_class = None or False) and other class types
问题
我正在尝试从一个包含 td
对象的网站上进行数据抓取。其中一些没有类,可以使用以下方法提取:
object.find_all("td", class_=None)
另一些有一个名为 sem_dados
的类,可以使用以下方法提取:
object.find_all("td", class_="sem_dados")
主要问题是:我不能同时执行这两个操作。例如,
object.find_all("td", class_=[None, "sem_dados"])
不会返回没有类的 td
对象。这似乎是由于列表中的 None
或 False
的行为而引起的问题,因为
object.find_all("td", class_=[None])
也会返回一个空列表。
有人知道如何更改语法,以便同时调用这两个操作吗?提取的顺序很重要。我可以手动重新排序,但我相信一定有一种语法可以实现我所尝试的操作。
尝试了许多不同的语法,但仍然无法使其正常工作。
英文:
I am trying to scrap from a website that has td
objects. Some of those have no class, which I can extract with
object.find_all("td", class_=None)
And others have a class called sem_dados
, which I can extract using
object.find_all("td", class_="sem_dados")
Main issue is: I can't do both at the same time. For instance,
object.find_all("td", class_=[None, "sem_dados"])
will not return the td
objects that have no class. This seems to be a problem with the None
or False
behavior within a list, since
object.find_all("td", class_=[None])
Will also return an empty list.
Anyone knows how to change the syntax so I can call both together? The ordering of the extraction would be important. I could manually reorder, but I believe there must be a syntax to do what I am trying to do.
Tried many different syntaxes, but still couldn't get something working.
答案1
得分: 0
也许你可以使用自定义的 lambda
函数:
from bs4 import BeautifulSoup
html_doc = '''\
<td class="sem_dados">I want this 1</td>
<td class="other">I don't want this</td>
<td>I want this 2</td>'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('td', class_=lambda c: not c or 'sem_dados' == c))
输出结果:
[<td class="sem_dados">I want this 1</td>, <td>I want this 2</td>]
英文:
Maybe you can use custom lambda
function:
from bs4 import BeautifulSoup
html_doc = '''\
<td class="sem_dados">I want this 1</td>
<td class="other">I don't want this</td>
<td>I want this 2</td>'''
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('td', class_=lambda c: not c or 'sem_dados' == c))
Prints:
[<td class="sem_dados">I want this 1</td>, <td>I want this 2</td>]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论