根据多列的条件从DataFrame中删除重复行

huangapple go评论93阅读模式
英文:

Drop duplicate rows from DataFrame based on conditions on multiple columns

问题

以下是翻译好的部分:

  1. 我有如下的数据框
  2. | id | value | date |
  3. | --- | ------|------------------ |
  4. | 001 | True |01/01/2022 00:00:00|
  5. | 002 | False |03/01/2022 00:00:00|
  6. | 003 | True |03/01/2022 00:00:00|
  7. | 001 | False |01/01/2022 01:30:00|
  8. | 001 | True |01/01/2022 01:30:00|
  9. | 002 | True |03/01/2022 00:00:00|
  10. | 003 | True |03/01/2022 00:30:00|
  11. | 004 | False |03/01/2022 00:30:00|
  12. | 005 | False |01/01/2022 00:00:00|
  13. 在原始数据框中有一些重复的行我想根据以下条件删除重复的行
  14. - 如果在**相同日期和相同时间存在重复的id**则选择值为"True"的行例如id = 002
  15. - 如果存在**相同值的重复id**则选择具有最新日期和时间的行例如id = 003
  16. - 如果存在**重复的id**则选择具有最新日期和时间并且值为"True"的行例如id = 001
  17. 预期的输出是
  18. | id | value | date |
  19. | --- | ------|------------------ |
  20. | 001 | True |01/01/2022 01:30:00|
  21. | 002 | True |03/01/2022 00:00:00|
  22. | 003 | True |03/01/2022 00:30:00|
  23. | 004 | False |03/01/2022 00:30:00|
  24. | 005 | False |01/01/2022 00:00:00|
  25. 有人可以建议我如何根据上述条件从数据框中删除重复项吗
  26. 谢谢

希望这能帮助你。

英文:

I have dataframe as follow:

id value date
001 True 01/01/2022 00:00:00
002 False 03/01/2022 00:00:00
003 True 03/01/2022 00:00:00
001 False 01/01/2022 01:30:00
001 True 01/01/2022 01:30:00
002 True 03/01/2022 00:00:00
003 True 03/01/2022 00:30:00
004 False 03/01/2022 00:30:00
005 False 01/01/2022 00:00:00

There are some duplicate rows in the raw dataframe and I would like to remove duplicate rows based on following conditions:

  • If there are duplicate ids on the same date and same time, select a row with value "True" (e.g., id = 002)
  • If there are duplicate ids with same value, select a row with the latest date and time (e.g., id == 003)
  • If there are duplicate ids, select row with the latest date and time and select a row with value "True" (e.g., id == 001)

Expected output:

id value date
001 True 01/01/2022 01:30:00
002 True 03/01/2022 00:00:00
003 True 03/01/2022 00:30:00
004 False 03/01/2022 00:30:00
005 False 01/01/2022 00:00:00

Can somebody suggested me how to drop duplicates from dataframe based on above mentioned conditions ?

Thanks.

答案1

得分: 1

output = (
df.sort_values(by=['date', 'value'], ascending=False)
.drop_duplicates(subset='id')
.sort_values(by='id')
)

print(output)

Output

  1. id value date
  2. 4 1 True 2022-01-01 01:30:00
  3. 5 2 True 2022-03-01 00:00:00
  4. 6 3 True 2022-03-01 00:30:00
  5. 7 4 False 2022-03-01 00:30:00
  6. 8 5 False 2022-01-01 00:00:00
英文:

It looks like perhaps you just need to sort your dataframe prior to dropping duplicates. Something like this:

  1. output = (
  2. df.sort_values(by=['date','value'], ascending=False)
  3. .drop_duplicates(subset='id')
  4. .sort_values(by='id')
  5. )

print(output)

Output

  1. id value date
  2. 4 1 True 2022-01-01 01:30:00
  3. 5 2 True 2022-03-01 00:00:00
  4. 6 3 True 2022-03-01 00:30:00
  5. 7 4 False 2022-03-01 00:30:00
  6. 8 5 False 2022-01-01 00:00:00

huangapple
  • 本文由 发表于 2023年2月14日 22:10:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/75449043.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定