英文:
Beautiful Soup - ignore `<span>` while providing `string` to `find()` method
问题
以下是您要翻译的内容:
我正在使用BeautifulSoup4在Python中解析一些文本。
地址块以类似这样的单元格开头:
<td><strong>Address</strong></td>
我使用soup.find("td", "Address")
来查找上述单元格。
但现在,有些地址还有一个突出显示的字符,类似这样:
<td><strong><span>*</span>Address</strong></td>
这破坏了我的匹配。是否仍然有办法找到这个TR?
英文:
I am parsing some text in Python, using BeautifulSoup4.
The address block starts with a cell like this:
<td><strong>Address</strong></td>
I find the above cell using soup.find("td", "Address")
But, now some addresses have a highlight character too, like this:
<td><strong><span>*</span>Address</strong></td>
This has broken my matching. Is there still a way to find this TR?
答案1
得分: 1
你可以尝试使用CSS选择器或re
如下:
soup.select('td:has(strong:contains("Address"))')
或者
import re
soup.find("td", text=re.compile("Address"))
英文:
You can try using either CSS selector or re
as follows:
soup.select('td:has(strong:contains("Address"))')
OR
import re
soup.find("td", text=re.compile("Address"))
答案2
得分: 0
我得到的解决方案如下:
strong_blocks = soup.find_all("strong")
def common_block(tag):
return tag.find(string="Address", recursive=False)
address_texts = list(filter(common_block, strong_blocks))
if len(address_texts) == 1:
address_text = address_texts[0]
address_cell = address_text.parent
这个“技巧”是,一旦我有了<strong>
元素的列表,我可以使用recursive=False
来防止<span>
被检查。
英文:
I ended up with a solution like this:
strong_blocks = soup.find_all("strong")
def common_block(tag):
return tag.find(string="Address", recursive=False)
address_texts = list(filter(common_block, strong_blocks))
if len(address_texts) == 1:
address_text = address_texts[0]
address_cell = address_text.parent
The trick was that once I had a list of <strong>
elements, I was able to use recursive=False
to prevent the <span>
being inspected.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论