Beautiful Soup – 在提供 `string` 给 `find()` 方法时忽略 `<span>`

huangapple go评论100阅读模式
英文:

Beautiful Soup - ignore `<span>` while providing `string` to `find()` method

问题

以下是您要翻译的内容:

我正在使用BeautifulSoup4在Python中解析一些文本。

地址块以类似这样的单元格开头:

  1. &lt;td&gt;&lt;strong&gt;Address&lt;/strong&gt;&lt;/td&gt;

我使用soup.find(&quot;td&quot;, &quot;Address&quot;)来查找上述单元格。

但现在,有些地址还有一个突出显示的字符,类似这样:

  1. &lt;td&gt;&lt;strong&gt;&lt;span&gt;*&lt;/span&gt;Address&lt;/strong&gt;&lt;/td&gt;

这破坏了我的匹配。是否仍然有办法找到这个TR?

英文:

I am parsing some text in Python, using BeautifulSoup4.

The address block starts with a cell like this:

  1. &lt;td&gt;&lt;strong&gt;Address&lt;/strong&gt;&lt;/td&gt;

I find the above cell using soup.find(&quot;td&quot;, &quot;Address&quot;)

But, now some addresses have a highlight character too, like this:

  1. &lt;td&gt;&lt;strong&gt;&lt;span&gt;*&lt;/span&gt;Address&lt;/strong&gt;&lt;/td&gt;

This has broken my matching. Is there still a way to find this TR?

答案1

得分: 1

你可以尝试使用CSS选择器或re如下:

  1. soup.select('td:has(strong:contains("Address"))')

或者

  1. import re
  2. soup.find("td", text=re.compile("Address"))
英文:

You can try using either CSS selector or re as follows:

  1. soup.select(&#39;td:has(strong:contains(&quot;Address&quot;))&#39;)

OR

  1. import re
  2. soup.find(&quot;td&quot;, text=re.compile(&quot;Address&quot;))

答案2

得分: 0

我得到的解决方案如下:

  1. strong_blocks = soup.find_all("strong")
  2. def common_block(tag):
  3. return tag.find(string="Address", recursive=False)
  4. address_texts = list(filter(common_block, strong_blocks))
  5. if len(address_texts) == 1:
  6. address_text = address_texts[0]
  7. address_cell = address_text.parent

这个“技巧”是,一旦我有了<strong>元素的列表,我可以使用recursive=False来防止<span>被检查。

英文:

I ended up with a solution like this:

  1. strong_blocks = soup.find_all(&quot;strong&quot;)
  2. def common_block(tag):
  3. return tag.find(string=&quot;Address&quot;, recursive=False)
  4. address_texts = list(filter(common_block, strong_blocks))
  5. if len(address_texts) == 1:
  6. address_text = address_texts[0]
  7. address_cell = address_text.parent

The trick was that once I had a list of &lt;strong&gt; elements, I was able to use recursive=False to prevent the &lt;span&gt; being inspected.

huangapple
  • 本文由 发表于 2023年4月7日 00:04:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/75951558.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定