根据多列的条件从DataFrame中删除重复行

huangapple go评论64阅读模式
英文:

Drop duplicate rows from DataFrame based on conditions on multiple columns

问题

以下是翻译好的部分:

我有如下的数据框

| id  | value | date              |
| --- | ------|------------------ | 
| 001 | True  |01/01/2022 00:00:00|
| 002 | False |03/01/2022 00:00:00|
| 003 | True  |03/01/2022 00:00:00|
| 001 | False |01/01/2022 01:30:00|
| 001 | True  |01/01/2022 01:30:00|
| 002 | True  |03/01/2022 00:00:00|
| 003 | True  |03/01/2022 00:30:00|
| 004 | False |03/01/2022 00:30:00|
| 005 | False |01/01/2022 00:00:00|

在原始数据框中有一些重复的行我想根据以下条件删除重复的行

 - 如果在**相同日期和相同时间存在重复的id**则选择值为"True"的行例如id = 002
 - 如果存在**相同值的重复id**则选择具有最新日期和时间的行例如id = 003
 - 如果存在**重复的id**则选择具有最新日期和时间并且值为"True"的行例如id = 001

预期的输出是

| id  | value | date              |
| --- | ------|------------------ |
| 001 | True  |01/01/2022 01:30:00|
| 002 | True  |03/01/2022 00:00:00|
| 003 | True  |03/01/2022 00:30:00|
| 004 | False |03/01/2022 00:30:00|
| 005 | False |01/01/2022 00:00:00|

有人可以建议我如何根据上述条件从数据框中删除重复项吗

谢谢

希望这能帮助你。

英文:

I have dataframe as follow:

id value date
001 True 01/01/2022 00:00:00
002 False 03/01/2022 00:00:00
003 True 03/01/2022 00:00:00
001 False 01/01/2022 01:30:00
001 True 01/01/2022 01:30:00
002 True 03/01/2022 00:00:00
003 True 03/01/2022 00:30:00
004 False 03/01/2022 00:30:00
005 False 01/01/2022 00:00:00

There are some duplicate rows in the raw dataframe and I would like to remove duplicate rows based on following conditions:

  • If there are duplicate ids on the same date and same time, select a row with value "True" (e.g., id = 002)
  • If there are duplicate ids with same value, select a row with the latest date and time (e.g., id == 003)
  • If there are duplicate ids, select row with the latest date and time and select a row with value "True" (e.g., id == 001)

Expected output:

id value date
001 True 01/01/2022 01:30:00
002 True 03/01/2022 00:00:00
003 True 03/01/2022 00:30:00
004 False 03/01/2022 00:30:00
005 False 01/01/2022 00:00:00

Can somebody suggested me how to drop duplicates from dataframe based on above mentioned conditions ?

Thanks.

答案1

得分: 1

output = (
df.sort_values(by=['date', 'value'], ascending=False)
.drop_duplicates(subset='id')
.sort_values(by='id')
)

print(output)

Output

   id  value                date
4   1   True 2022-01-01 01:30:00
5   2   True 2022-03-01 00:00:00
6   3   True 2022-03-01 00:30:00
7   4  False 2022-03-01 00:30:00
8   5  False 2022-01-01 00:00:00
英文:

It looks like perhaps you just need to sort your dataframe prior to dropping duplicates. Something like this:

output =    (
df.sort_values(by=['date','value'], ascending=False)
.drop_duplicates(subset='id')
.sort_values(by='id')
)

print(output)

Output

   id  value                date
4   1   True 2022-01-01 01:30:00
5   2   True 2022-03-01 00:00:00
6   3   True 2022-03-01 00:30:00
7   4  False 2022-03-01 00:30:00
8   5  False 2022-01-01 00:00:00

huangapple
  • 本文由 发表于 2023年2月14日 22:10:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/75449043.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定