英文:
Drop duplicate rows from DataFrame based on conditions on multiple columns
问题
以下是翻译好的部分:
我有如下的数据框:
| id | value | date |
| --- | ------|------------------ |
| 001 | True |01/01/2022 00:00:00|
| 002 | False |03/01/2022 00:00:00|
| 003 | True |03/01/2022 00:00:00|
| 001 | False |01/01/2022 01:30:00|
| 001 | True |01/01/2022 01:30:00|
| 002 | True |03/01/2022 00:00:00|
| 003 | True |03/01/2022 00:30:00|
| 004 | False |03/01/2022 00:30:00|
| 005 | False |01/01/2022 00:00:00|
在原始数据框中有一些重复的行,我想根据以下条件删除重复的行:
- 如果在**相同日期和相同时间存在重复的id**,则选择值为"True"的行(例如,id = 002)
- 如果存在**相同值的重复id**,则选择具有最新日期和时间的行(例如,id = 003)
- 如果存在**重复的id**,则选择具有最新日期和时间并且值为"True"的行(例如,id = 001)
预期的输出是:
| id | value | date |
| --- | ------|------------------ |
| 001 | True |01/01/2022 01:30:00|
| 002 | True |03/01/2022 00:00:00|
| 003 | True |03/01/2022 00:30:00|
| 004 | False |03/01/2022 00:30:00|
| 005 | False |01/01/2022 00:00:00|
有人可以建议我如何根据上述条件从数据框中删除重复项吗?
谢谢。
希望这能帮助你。
英文:
I have dataframe as follow:
id | value | date |
---|---|---|
001 | True | 01/01/2022 00:00:00 |
002 | False | 03/01/2022 00:00:00 |
003 | True | 03/01/2022 00:00:00 |
001 | False | 01/01/2022 01:30:00 |
001 | True | 01/01/2022 01:30:00 |
002 | True | 03/01/2022 00:00:00 |
003 | True | 03/01/2022 00:30:00 |
004 | False | 03/01/2022 00:30:00 |
005 | False | 01/01/2022 00:00:00 |
There are some duplicate rows in the raw dataframe and I would like to remove duplicate rows based on following conditions:
- If there are duplicate ids on the same date and same time, select a row with value "True" (e.g., id = 002)
- If there are duplicate ids with same value, select a row with the latest date and time (e.g., id == 003)
- If there are duplicate ids, select row with the latest date and time and select a row with value "True" (e.g., id == 001)
Expected output:
id | value | date |
---|---|---|
001 | True | 01/01/2022 01:30:00 |
002 | True | 03/01/2022 00:00:00 |
003 | True | 03/01/2022 00:30:00 |
004 | False | 03/01/2022 00:30:00 |
005 | False | 01/01/2022 00:00:00 |
Can somebody suggested me how to drop duplicates from dataframe based on above mentioned conditions ?
Thanks.
答案1
得分: 1
output = (
df.sort_values(by=['date', 'value'], ascending=False)
.drop_duplicates(subset='id')
.sort_values(by='id')
)
print(output)
Output
id value date
4 1 True 2022-01-01 01:30:00
5 2 True 2022-03-01 00:00:00
6 3 True 2022-03-01 00:30:00
7 4 False 2022-03-01 00:30:00
8 5 False 2022-01-01 00:00:00
英文:
It looks like perhaps you just need to sort your dataframe prior to dropping duplicates. Something like this:
output = (
df.sort_values(by=['date','value'], ascending=False)
.drop_duplicates(subset='id')
.sort_values(by='id')
)
print(output)
Output
id value date
4 1 True 2022-01-01 01:30:00
5 2 True 2022-03-01 00:00:00
6 3 True 2022-03-01 00:30:00
7 4 False 2022-03-01 00:30:00
8 5 False 2022-01-01 00:00:00
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论