英文:
Remove duplicated HTML elements
问题
You can remove the duplicate span elements from the BeautifulSoup ResultSet by converting it to a set and then back to a list. Here's the code to do that:
unique_z = list(set(z))
This will give you a list containing unique span elements without duplicates.
英文:
I have this python code snippet :
from bs4 import BeautifulSoup
with open("source.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
z = soup.find_all('span', {'class' : 'contributor'})
print(z)
print("---")
print(type(z))
that returns me this
[<span class="contributor">Andrew</span>,
<span class="contributor">John</span>,
<span class="contributor">Andrew</span>,
<span class="contributor">Maria</span>]
---
<class 'bs4.element.ResultSet'>
Be careful z type returns <class 'bs4.element.ResultSet'>
.
Also the elements of z are not strings .
print(type(z[0]))
returns <class 'bs4.element.Tag'>
Do you have any idea how can I remove the whole duplicated span element ? ( e.g Andrew
is duplicated 2 times )
答案1
得分: 1
我假设一个相当愚蠢的HTML内容,但它起作用,你可以简单地执行 a = list(set(z))
,它会删除重复项,我测试过了:
from bs4 import BeautifulSoup
#假设一个愚蠢的HTML
assume_html = """
<html>
<span class="contributor">Andrew</span>
<span class="contributor">John</span>
<span class="contributor">Andrew</span>
<span class="contributor">Maria</span>
</html>"""
soup = BeautifulSoup(assume_html, 'html.parser')
z = soup.find_all('span', {'class' : 'contributor'})
for element in z :
print("element ", element, type(element))
print("---")
print(type(z))
print("------")
a = list(set(z))
for new_element in a :
print("new_element ", new_element, type(new_element))
print("---")
print(type(a))
这是输出结果:
element <span class="contributor">Andrew</span> <class 'bs4.element.Tag'>
element <span class="contributor">John</span> <class 'bs4.element.Tag'>
element <span class="contributor">Andrew</span> <class 'bs4.element.Tag'>
element <span class="contributor">Maria</span> <class 'bs4.element.Tag'>
---
<class 'bs4.element.ResultSet'>
------
new_element <span class="contributor">Andrew</span> <class 'bs4.element.Tag'>
new_element <span class="contributor">John</span> <class 'bs4.element.Tag'>
new_element <span class="contributor">Maria</span> <class 'bs4.element.Tag'>
---
<class 'list'>
英文:
I assumed a quite dumb html content but it worked, you can simply do a = list(set(z))
and it will remove the duplicates, I tested it :
from bs4 import BeautifulSoup
#assuming a dumb HTML
assume_html = """
<html>
<span class="contributor">Andrew</span>
<span class="contributor">John</span>
<span class="contributor">Andrew</span>
<span class="contributor">Maria</span>
</html>"""
soup = BeautifulSoup(assume_html, 'html.parser')
z = soup.find_all('span', {'class' : 'contributor'})
for element in z :
print("element ", element, type(element))
print("---")
print(type(z))
print("------")
a = list(set(z))
for new_element in a :
print("new_element ", new_element, type(new_element))
print("---")
print(type(a))
Here is the output :
element <span class="contributor">Andrew</span> <class 'bs4.element.Tag'>
element <span class="contributor">John</span> <class 'bs4.element.Tag'>
element <span class="contributor">Andrew</span> <class 'bs4.element.Tag'>
element <span class="contributor">Maria</span> <class 'bs4.element.Tag'>
---
<class 'bs4.element.ResultSet'>
------
new_element <span class="contributor">Andrew</span> <class 'bs4.element.Tag'>
new_element <span class="contributor">John</span> <class 'bs4.element.Tag'>
new_element <span class="contributor">Maria</span> <class 'bs4.element.Tag'>
---
<class 'list'>
答案2
得分: 0
你可以遍历z
的每个元素,然后使用一个set
来跟踪已经见过的元素。然后,如果一个元素已经被看到过,就从z
中移除它:
import bs4
with open("source.html") as fp:
soup = bs4.BeautifulSoup(fp, 'html.parser')
z = soup.find_all('span', {'class': 'contributor'})
used = set()
for el in z:
if el.text not in used:
used.add(el.text)
else:
z.remove(el)
print(z)
print(type(z))
或者,你可以使用el.decompose()
代替z.remove(el)
,这将销毁该元素,但保留一个空节点:
import bs4
with open("source.html") as fp:
soup = bs4.BeautifulSoup(fp, 'html.parser')
z = soup.find_all('span', {'class': 'contributor'})
used = set()
for el in z:
if el.text not in used:
used.add(el.text)
else:
el.decompose()
print(z)
print(type(z))
英文:
You can loop through each element of z
, then keep track of which elements have already seen by using a set
. Then, if an element has already been seen, just remove it from z
:
import bs4
with open("source.html") as fp:
soup = bs4.BeautifulSoup(fp, 'html.parser')
z = soup.find_all('span', {'class' : 'contributor'})
used = set()
for el in z:
if el.text not in used:
used.add(el.text)
else:
z.remove(el)
print(z)
print(type(z))
Alternatively, you can use el.decompose()
instead of z.remove(el)
and this will destroy the element, but leave an empty node:
import bs4
with open("source.html") as fp:
soup = bs4.BeautifulSoup(fp, 'html.parser')
z = soup.find_all('span', {'class' : 'contributor'})
used = set()
for el in z:
if el.text not in used:
used.add(el.text)
else:
el.decompose()
print(z)
print(type(z))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论