2023年3月31日 17:19:44go评论101阅读模式

英文:

How to drop duplicate from a pandas dataframe with some complex conditions?

问题

我正在尝试根据一些条件删除重复项。我的数据框如下所示：

idx  a    b    c   d   e  f
1    1   ss1   0   25  A  B
2    3   ss7   0   25  A  B
3    5   ss5   0   12  C  D
4    11  im3   0   12  C  D
5    5   ss8   0   50  C  K
6    9   im8   0   5   F  G
7    8   ix6   0   5   F  G

如果列 d、e 和 f 的值在数据框中的其他记录中匹配，则认为行是重复的，subset=['d', 'e', 'f']。例如，行1和行2是重复的，行3和行4是重复的，行6和行7是重复的。选择要删除的行基于列 b。

如果列 b 中的值都以 ss 开头（例如行1和行2），则可以删除任何一个。
如果重复项中有一个以 ss 开头，而另一个以不同格式开头（例如行3和行4），则应保留以 ss 开头的那一个。
如果列 b 中的重复项都不以 ss 开头（例如行6和行7），则可以选择任何一个。

因此，预期输出应该类似于以下内容：

idx  a    b    c   d   e  f
2    3   ss7   0   25  A  B
3    5   ss5   0   12  C  D
5    5   ss8   0   50  C  K
7    8   ix6   0   5   F  G

英文:

I am trying to drop duplicates, but based on some conditions. My dataframe looks like this:

idx  a    b    c   d   e  f
1    1   ss1   0   25  A  B
2    3   ss7   0   25  A  B
3    5   ss5   0   12  C  D
4    11  im3   0   12  C  D
5    5   ss8   0   50  C  K
6    9   im8   0   5   F  G
7    8   ix6   0   5   F  G

Rows are considered duplicates if the values of columns d, e and f together match other records in the dataframe subset=['d', 'e', 'f']. For example, rows 1 and 2 are duplicates, rows 3 and 4 are duplicates, and rows 6 and 7 are duplicates. The selection of which row to drop is based on column b.

If the value in column b begins with ss for both duplicates (rows 1 and 2), then anyone can be dropped
If one of the duplicates begins with ss and the other begins with a different format (rows 3 and 4), then the one that begins with ss should be kept.
If both duplicates in column b begin with anything other than ss (rows 6 and 7), then anyone can be selected.

Therefore, the expected output should be something like this:

idx  a    b    c   d   e  f
2    3   ss7   0   25  A  B
3    5   ss5   0   12  C  D
5    5   ss8   0   50  C  K
7    8   ix6   0   5   F  G

答案1

得分: 3

按 b 键首先排序（所有以 'ss' 开头的项目移到末尾），然后从 [‘d’, ‘e’, ‘f’] 中去除重复项（保留最后一个）：

out = (df.sort_values('b', key=lambda x: x.str.startswith('ss'))
         .drop_duplicates(['d', 'e', 'f'], keep='last').sort_index())

或者

out = (df.sort_values('b', key=lambda x: x.str.startswith('ss'))
         .groupby(['d', 'e', 'f'], as_index=False).nth(-1).sort_index())

输出：

>>> out
   idx  a    b  c   d  e  f
1    2  3  ss7  0  25  A  B
2    3  5  ss5  0  12  C  D
4    5  5  ss8  0  50  C  K
6    7  8  ix6  0   5  F  G

英文:

Sort by b key first (everything starts by 'ss' is moved to the end) then drop duplicates from ['d', 'e', 'f'] (keep the last):

out = (df.sort_values(&#39;b&#39;, key=lambda x: x.str.startswith(&#39;ss&#39;))
         .drop_duplicates([&#39;d&#39;, &#39;e&#39;, &#39;f&#39;], keep=&#39;last&#39;).sort_index())
# OR
out = (df.sort_values(&#39;b&#39;, key=lambda x: x.str.startswith(&#39;ss&#39;))
         .groupby([&#39;d&#39;, &#39;e&#39;, &#39;f&#39;], as_index=False).nth(-1).sort_index())

Output:

&gt;&gt;&gt; out
   idx  a    b  c   d  e  f
1    2  3  ss7  0  25  A  B
2    3  5  ss5  0  12  C  D
4    5  5  ss8  0  50  C  K
6    7  8  ix6  0   5  F  G

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从 pandas 数据框中删除具有一些复杂条件的重复项？

问题

答案1

如何使用Python在Excel中将列分割为两个子列，放在其父列下。

如何合并多个CSV文件？

使用两个数据框基于关键词生成最终数据框。

Elpy-rpc in Emacs gives 'exited abnormally with code 1' error and unexpected output. How can I fix it?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。