2023年2月14日 22:10:41go评论93阅读模式

英文:

Drop duplicate rows from DataFrame based on conditions on multiple columns

问题

以下是翻译好的部分：

我有如下的数据框：
| id  | value | date              |
| --- | ------|------------------ | 
| 001 | True  |01/01/2022 00:00:00|
| 002 | False |03/01/2022 00:00:00|
| 003 | True  |03/01/2022 00:00:00|
| 001 | False |01/01/2022 01:30:00|
| 001 | True  |01/01/2022 01:30:00|
| 002 | True  |03/01/2022 00:00:00|
| 003 | True  |03/01/2022 00:30:00|
| 004 | False |03/01/2022 00:30:00|
| 005 | False |01/01/2022 00:00:00|
在原始数据框中有一些重复的行，我想根据以下条件删除重复的行：
 - 如果在**相同日期和相同时间存在重复的id**，则选择值为"True"的行（例如，id = 002）
 - 如果存在**相同值的重复id**，则选择具有最新日期和时间的行（例如，id = 003）
 - 如果存在**重复的id**，则选择具有最新日期和时间并且值为"True"的行（例如，id = 001）
预期的输出是：
| id  | value | date              |
| --- | ------|------------------ |
| 001 | True  |01/01/2022 01:30:00|
| 002 | True  |03/01/2022 00:00:00|
| 003 | True  |03/01/2022 00:30:00|
| 004 | False |03/01/2022 00:30:00|
| 005 | False |01/01/2022 00:00:00|
有人可以建议我如何根据上述条件从数据框中删除重复项吗？
谢谢。

希望这能帮助你。

英文:

I have dataframe as follow:

id	value	date
001	True	01/01/2022 00:00:00
002	False	03/01/2022 00:00:00
003	True	03/01/2022 00:00:00
001	False	01/01/2022 01:30:00
001	True	01/01/2022 01:30:00
002	True	03/01/2022 00:00:00
003	True	03/01/2022 00:30:00
004	False	03/01/2022 00:30:00
005	False	01/01/2022 00:00:00

There are some duplicate rows in the raw dataframe and I would like to remove duplicate rows based on following conditions:

If there are duplicate ids on the same date and same time, select a row with value "True" (e.g., id = 002)
If there are duplicate ids with same value, select a row with the latest date and time (e.g., id == 003)
If there are duplicate ids, select row with the latest date and time and select a row with value "True" (e.g., id == 001)

Expected output:

id	value	date
001	True	01/01/2022 01:30:00
002	True	03/01/2022 00:00:00
003	True	03/01/2022 00:30:00
004	False	03/01/2022 00:30:00
005	False	01/01/2022 00:00:00

Can somebody suggested me how to drop duplicates from dataframe based on above mentioned conditions ?

Thanks.

答案1

得分: 1

output = (
df.sort_values(by=['date', 'value'], ascending=False)
.drop_duplicates(subset='id')
.sort_values(by='id')
)

print(output)

Output

   id  value                date
4   1   True 2022-01-01 01:30:00
5   2   True 2022-03-01 00:00:00
6   3   True 2022-03-01 00:30:00
7   4  False 2022-03-01 00:30:00
8   5  False 2022-01-01 00:00:00

英文:

It looks like perhaps you just need to sort your dataframe prior to dropping duplicates. Something like this:

output =    (
df.sort_values(by=[&#39;date&#39;,&#39;value&#39;], ascending=False)
.drop_duplicates(subset=&#39;id&#39;)
.sort_values(by=&#39;id&#39;)
)

print(output)

Output

   id  value                date
4   1   True 2022-01-01 01:30:00
5   2   True 2022-03-01 00:00:00
6   3   True 2022-03-01 00:30:00
7   4  False 2022-03-01 00:30:00
8   5  False 2022-01-01 00:00:00

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据多列的条件从DataFrame中删除重复行

问题

答案1

Is it possible to specify a foreign key for a table field in Django models.py to any table from the database?

特定函数的泰勒级数逼近

将来自Google Cloud Storage的Parquet文件的分区列添加到BigQuery。

多任务管理在Python 3中

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。