删除重复的HTML元素。

huangapple go评论60阅读模式
英文:

Remove duplicated HTML elements

问题

You can remove the duplicate span elements from the BeautifulSoup ResultSet by converting it to a set and then back to a list. Here's the code to do that:

unique_z = list(set(z))

This will give you a list containing unique span elements without duplicates.

英文:

I have this python code snippet :

from bs4 import BeautifulSoup


with open("source.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')


z = soup.find_all('span', {'class' : 'contributor'})
print(z)
print("---")
print(type(z))

that returns me this

[<span class="contributor">Andrew</span>, 
<span class="contributor">John</span>,
<span class="contributor">Andrew</span>,
<span class="contributor">Maria</span>]
---
<class 'bs4.element.ResultSet'>


Be careful z type returns <class 'bs4.element.ResultSet'> .
Also the elements of z are not strings .
print(type(z[0])) returns <class 'bs4.element.Tag'>

Do you have any idea how can I remove the whole duplicated span element ? ( e.g Andrew is duplicated 2 times )

答案1

得分: 1

我假设一个相当愚蠢的HTML内容,但它起作用,你可以简单地执行 a = list(set(z)) ,它会删除重复项,我测试过了:

from bs4 import BeautifulSoup

#假设一个愚蠢的HTML
assume_html = """
<html>
    <span class="contributor">Andrew</span>
    <span class="contributor">John</span>
    <span class="contributor">Andrew</span>
    <span class="contributor">Maria</span>
</html>"""

soup = BeautifulSoup(assume_html, 'html.parser')

z = soup.find_all('span', {'class' : 'contributor'})
for element in z :
    print("element ", element, type(element))
print("---")
print(type(z))
print("------")

a = list(set(z))
for new_element in a :
    print("new_element ", new_element, type(new_element))
print("---")
print(type(a))

这是输出结果:

element  <span class="contributor">Andrew</span> <class 'bs4.element.Tag'>
element  <span class="contributor">John</span> <class 'bs4.element.Tag'>      
element  <span class="contributor">Andrew</span> <class 'bs4.element.Tag'>    
element  <span class="contributor">Maria</span> <class 'bs4.element.Tag'>     
---
<class 'bs4.element.ResultSet'>
------
new_element  <span class="contributor">Andrew</span> <class 'bs4.element.Tag'>
new_element  <span class="contributor">John</span> <class 'bs4.element.Tag'>  
new_element  <span class="contributor">Maria</span> <class 'bs4.element.Tag'> 
---
<class 'list'>
英文:

I assumed a quite dumb html content but it worked, you can simply do a = list(set(z)) and it will remove the duplicates, I tested it :

from bs4 import BeautifulSoup

#assuming a dumb HTML
assume_html = """
<html>
    <span class="contributor">Andrew</span>
    <span class="contributor">John</span>
    <span class="contributor">Andrew</span>
    <span class="contributor">Maria</span>
</html>"""

soup = BeautifulSoup(assume_html, 'html.parser')

z = soup.find_all('span', {'class' : 'contributor'})
for element in z :
    print("element ", element, type(element))
print("---")
print(type(z))
print("------")

a = list(set(z))
for new_element in a :
    print("new_element ", new_element, type(new_element))
print("---")
print(type(a))

Here is the output :

element  <span class="contributor">Andrew</span> <class 'bs4.element.Tag'>
element  <span class="contributor">John</span> <class 'bs4.element.Tag'>      
element  <span class="contributor">Andrew</span> <class 'bs4.element.Tag'>    
element  <span class="contributor">Maria</span> <class 'bs4.element.Tag'>     
---
<class 'bs4.element.ResultSet'>
------
new_element  <span class="contributor">Andrew</span> <class 'bs4.element.Tag'>
new_element  <span class="contributor">John</span> <class 'bs4.element.Tag'>  
new_element  <span class="contributor">Maria</span> <class 'bs4.element.Tag'> 
---
<class 'list'>

答案2

得分: 0

你可以遍历z的每个元素,然后使用一个set来跟踪已经见过的元素。然后,如果一个元素已经被看到过,就从z中移除它:

import bs4

with open("source.html") as fp:
    soup = bs4.BeautifulSoup(fp, 'html.parser')

z = soup.find_all('span', {'class': 'contributor'})

used = set()
for el in z:
    if el.text not in used:
        used.add(el.text)
    else:
        z.remove(el)

print(z)
print(type(z))

或者,你可以使用el.decompose()代替z.remove(el),这将销毁该元素,但保留一个空节点:

import bs4

with open("source.html") as fp:
    soup = bs4.BeautifulSoup(fp, 'html.parser')

z = soup.find_all('span', {'class': 'contributor'})

used = set()
for el in z:
    if el.text not in used:
        used.add(el.text)
    else:
        el.decompose()

print(z)
print(type(z))
英文:

You can loop through each element of z, then keep track of which elements have already seen by using a set. Then, if an element has already been seen, just remove it from z:

import bs4


with open("source.html") as fp:
    soup = bs4.BeautifulSoup(fp, 'html.parser')


z = soup.find_all('span', {'class' : 'contributor'})

used = set()
for el in z:
    if el.text not in used:
        used.add(el.text)
    else:
        z.remove(el)

print(z)
print(type(z))

Alternatively, you can use el.decompose() instead of z.remove(el) and this will destroy the element, but leave an empty node:

import bs4


with open("source.html") as fp:
    soup = bs4.BeautifulSoup(fp, 'html.parser')


z = soup.find_all('span', {'class' : 'contributor'})

used = set()
for el in z:
    if el.text not in used:
        used.add(el.text)
    else:
        el.decompose()

print(z)
print(type(z))

huangapple
  • 本文由 发表于 2023年5月13日 23:02:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76243395.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定