2023年5月13日 23:02:52go评论60阅读模式

英文:

Remove duplicated HTML elements

问题

You can remove the duplicate span elements from the BeautifulSoup ResultSet by converting it to a set and then back to a list. Here's the code to do that:

unique_z = list(set(z))

This will give you a list containing unique span elements without duplicates.

英文:

I have this python code snippet :

from bs4 import BeautifulSoup


with open(&quot;source.html&quot;) as fp:
    soup = BeautifulSoup(fp, &#39;html.parser&#39;)


z = soup.find_all(&#39;span&#39;, {&#39;class&#39; : &#39;contributor&#39;})
print(z)
print(&quot;---&quot;)
print(type(z))

that returns me this

[&lt;span class=&quot;contributor&quot;&gt;Andrew&lt;/span&gt;, 
&lt;span class=&quot;contributor&quot;&gt;John&lt;/span&gt;,
&lt;span class=&quot;contributor&quot;&gt;Andrew&lt;/span&gt;,
&lt;span class=&quot;contributor&quot;&gt;Maria&lt;/span&gt;]
---
&lt;class &#39;bs4.element.ResultSet&#39;&gt;

Be careful z type returns <class 'bs4.element.ResultSet'> .
Also the elements of z are not strings .
print(type(z[0])) returns <class 'bs4.element.Tag'>

Do you have any idea how can I remove the whole duplicated span element ? ( e.g Andrew is duplicated 2 times )

答案1

得分: 1

我假设一个相当愚蠢的HTML内容，但它起作用，你可以简单地执行 a = list(set(z)) ，它会删除重复项，我测试过了：

from bs4 import BeautifulSoup

#假设一个愚蠢的HTML
assume_html = &quot;&quot;&quot;
&lt;html&gt;
    &lt;span class=&quot;contributor&quot;&gt;Andrew&lt;/span&gt;
    &lt;span class=&quot;contributor&quot;&gt;John&lt;/span&gt;
    &lt;span class=&quot;contributor&quot;&gt;Andrew&lt;/span&gt;
    &lt;span class=&quot;contributor&quot;&gt;Maria&lt;/span&gt;
&lt;/html&gt;&quot;&quot;&quot;

soup = BeautifulSoup(assume_html, &#39;html.parser&#39;)

z = soup.find_all(&#39;span&#39;, {&#39;class&#39; : &#39;contributor&#39;})
for element in z :
    print(&quot;element &quot;, element, type(element))
print(&quot;---&quot;)
print(type(z))
print(&quot;------&quot;)

a = list(set(z))
for new_element in a :
    print(&quot;new_element &quot;, new_element, type(new_element))
print(&quot;---&quot;)
print(type(a))

这是输出结果：

element  &lt;span class=&quot;contributor&quot;&gt;Andrew&lt;/span&gt; &lt;class &#39;bs4.element.Tag&#39;&gt;
element  &lt;span class=&quot;contributor&quot;&gt;John&lt;/span&gt; &lt;class &#39;bs4.element.Tag&#39;&gt;      
element  &lt;span class=&quot;contributor&quot;&gt;Andrew&lt;/span&gt; &lt;class &#39;bs4.element.Tag&#39;&gt;    
element  &lt;span class=&quot;contributor&quot;&gt;Maria&lt;/span&gt; &lt;class &#39;bs4.element.Tag&#39;&gt;     
---
&lt;class &#39;bs4.element.ResultSet&#39;&gt;
------
new_element  &lt;span class=&quot;contributor&quot;&gt;Andrew&lt;/span&gt; &lt;class &#39;bs4.element.Tag&#39;&gt;
new_element  &lt;span class=&quot;contributor&quot;&gt;John&lt;/span&gt; &lt;class &#39;bs4.element.Tag&#39;&gt;  
new_element  &lt;span class=&quot;contributor&quot;&gt;Maria&lt;/span&gt; &lt;class &#39;bs4.element.Tag&#39;&gt; 
---
&lt;class &#39;list&#39;&gt;

英文:

I assumed a quite dumb html content but it worked, you can simply do a = list(set(z)) and it will remove the duplicates, I tested it :

from bs4 import BeautifulSoup

#assuming a dumb HTML
assume_html = &quot;&quot;&quot;
&lt;html&gt;
    &lt;span class=&quot;contributor&quot;&gt;Andrew&lt;/span&gt;
    &lt;span class=&quot;contributor&quot;&gt;John&lt;/span&gt;
    &lt;span class=&quot;contributor&quot;&gt;Andrew&lt;/span&gt;
    &lt;span class=&quot;contributor&quot;&gt;Maria&lt;/span&gt;
&lt;/html&gt;&quot;&quot;&quot;

soup = BeautifulSoup(assume_html, &#39;html.parser&#39;)

z = soup.find_all(&#39;span&#39;, {&#39;class&#39; : &#39;contributor&#39;})
for element in z :
    print(&quot;element &quot;, element, type(element))
print(&quot;---&quot;)
print(type(z))
print(&quot;------&quot;)

a = list(set(z))
for new_element in a :
    print(&quot;new_element &quot;, new_element, type(new_element))
print(&quot;---&quot;)
print(type(a))

Here is the output :

element  &lt;span class=&quot;contributor&quot;&gt;Andrew&lt;/span&gt; &lt;class &#39;bs4.element.Tag&#39;&gt;
element  &lt;span class=&quot;contributor&quot;&gt;John&lt;/span&gt; &lt;class &#39;bs4.element.Tag&#39;&gt;      
element  &lt;span class=&quot;contributor&quot;&gt;Andrew&lt;/span&gt; &lt;class &#39;bs4.element.Tag&#39;&gt;    
element  &lt;span class=&quot;contributor&quot;&gt;Maria&lt;/span&gt; &lt;class &#39;bs4.element.Tag&#39;&gt;     
---
&lt;class &#39;bs4.element.ResultSet&#39;&gt;
------
new_element  &lt;span class=&quot;contributor&quot;&gt;Andrew&lt;/span&gt; &lt;class &#39;bs4.element.Tag&#39;&gt;
new_element  &lt;span class=&quot;contributor&quot;&gt;John&lt;/span&gt; &lt;class &#39;bs4.element.Tag&#39;&gt;  
new_element  &lt;span class=&quot;contributor&quot;&gt;Maria&lt;/span&gt; &lt;class &#39;bs4.element.Tag&#39;&gt; 
---
&lt;class &#39;list&#39;&gt;

答案2

得分: 0

你可以遍历z的每个元素，然后使用一个set来跟踪已经见过的元素。然后，如果一个元素已经被看到过，就从z中移除它：

import bs4

with open("source.html") as fp:
    soup = bs4.BeautifulSoup(fp, 'html.parser')

z = soup.find_all('span', {'class': 'contributor'})

used = set()
for el in z:
    if el.text not in used:
        used.add(el.text)
    else:
        z.remove(el)

print(z)
print(type(z))

或者，你可以使用el.decompose()代替z.remove(el)，这将销毁该元素，但保留一个空节点：

import bs4

with open("source.html") as fp:
    soup = bs4.BeautifulSoup(fp, 'html.parser')

z = soup.find_all('span', {'class': 'contributor'})

used = set()
for el in z:
    if el.text not in used:
        used.add(el.text)
    else:
        el.decompose()

print(z)
print(type(z))

英文:

You can loop through each element of z, then keep track of which elements have already seen by using a set. Then, if an element has already been seen, just remove it from z:

import bs4


with open(&quot;source.html&quot;) as fp:
    soup = bs4.BeautifulSoup(fp, &#39;html.parser&#39;)


z = soup.find_all(&#39;span&#39;, {&#39;class&#39; : &#39;contributor&#39;})

used = set()
for el in z:
    if el.text not in used:
        used.add(el.text)
    else:
        z.remove(el)

print(z)
print(type(z))

Alternatively, you can use el.decompose() instead of z.remove(el) and this will destroy the element, but leave an empty node:

import bs4


with open(&quot;source.html&quot;) as fp:
    soup = bs4.BeautifulSoup(fp, &#39;html.parser&#39;)


z = soup.find_all(&#39;span&#39;, {&#39;class&#39; : &#39;contributor&#39;})

used = set()
for el in z:
    if el.text not in used:
        used.add(el.text)
    else:
        el.decompose()

print(z)
print(type(z))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

删除重复的HTML元素。

问题

答案1

答案2

我的方法无法消除单链表中最大重复元素序列。

如何获取pandas数据框中每行的第二大值

`except`块在Python的`try`块中不起作用。

如何在Python中使用循环遍历字典列表并提取部分数值？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论