英文:
Remove duplicates in a Pandas data frame based on a column
问题
以下是翻译好的内容:
我有以下数据集,我想根据布尔列删除重复项。
日期时间 | 数值 | 布尔 |
---|---|---|
2023-02-14 10:15:00 | 195.35 | FALSE |
2023-02-14 11:15:00 | 195.8 | FALSE |
2023-02-14 12:15:00 | 195.87 | FALSE |
2023-02-14 13:15:00 | 196.06 | FALSE |
2023-02-14 14:15:00 | 195.97 | TRUE |
2023-02-14 15:15:00 | 195.98 | TRUE |
2023-02-15 09:15:00 | 196.23 | FALSE |
2023-02-15 10:15:00 | 196.3 | FALSE |
2023-02-15 11:15:00 | 196.26 | TRUE |
2023-02-15 12:15:00 | 196.4 | TRUE |
2023-02-15 13:15:00 | 196.28 | TRUE |
2023-02-15 14:15:00 | 197.14 | FALSE |
2023-02-15 15:15:00 | 197.08 | FALSE |
2023-02-16 09:15:00 | 197.85 | TRUE |
2023-02-16 10:15:00 | 198.01 | TRUE |
结果数据应该如下所示:
日期时间 | 数值 | 布尔 |
---|---|---|
2023-02-14 10:15:00 | 195.35 | FALSE |
2023-02-14 14:15:00 | 195.97 | TRUE |
2023-02-15 09:15:00 | 196.23 | FALSE |
2023-02-15 11:15:00 | 196.26 | TRUE |
2023-02-15 14:15:00 | 197.14 | FALSE |
2023-02-16 09:15:00 | 197.85 | TRUE |
我尝试过使用Pandas的drop_duplicates
,但这会将整个布尔列分组,然后删除重复项,这将导致只剩下2行。
PS:我可能只需循环遍历所有行并与前一行进行比较,但我正在寻找一种Pandas的原生方法来执行此操作,如果存在的话。
英文:
I have below data-set and I want to remove duplicates based on the bool column.
datetime | number | bool |
---|---|---|
2023-02-14 10:15:00 | 195.35 | FALSE |
2023-02-14 11:15:00 | 195.8 | FALSE |
2023-02-14 12:15:00 | 195.87 | FALSE |
2023-02-14 13:15:00 | 196.06 | FALSE |
2023-02-14 14:15:00 | 195.97 | TRUE |
2023-02-14 15:15:00 | 195.98 | TRUE |
2023-02-15 09:15:00 | 196.23 | FALSE |
2023-02-15 10:15:00 | 196.3 | FALSE |
2023-02-15 11:15:00 | 196.26 | TRUE |
2023-02-15 12:15:00 | 196.4 | TRUE |
2023-02-15 13:15:00 | 196.28 | TRUE |
2023-02-15 14:15:00 | 197.14 | FALSE |
2023-02-15 15:15:00 | 197.08 | FALSE |
2023-02-16 09:15:00 | 197.85 | TRUE |
2023-02-16 10:15:00 | 198.01 | TRUE |
Resulting data should look like this
datetime | number | bool |
---|---|---|
2023-02-14 10:15:00 | 195.35 | FALSE |
2023-02-14 14:15:00 | 195.97 | TRUE |
2023-02-15 09:15:00 | 196.23 | FALSE |
2023-02-15 11:15:00 | 196.26 | TRUE |
2023-02-15 14:15:00 | 197.14 | FALSE |
2023-02-16 09:15:00 | 197.85 | TRUE |
I tried pandas drop_duplicates but this will group the whole bool column and then removes duplicates, that will result in only 2 rows.
PS: I might just loop through all rows and compare to the previous but I am looking for some Panda's native way of doing this, if it exists.
答案1
得分: 1
你可以使用 boolean indexing
通过 Series.ne
和 Series.shift
来比较偏移的值:
out = df[df['bool'].ne(df['bool'].shift())]
print (out)
datetime number bool
0 2023-02-14 10:15:00 195.35 False
4 2023-02-14 14:15:00 195.97 True
6 2023-02-15 09:15:00 196.23 False
8 2023-02-15 11:15:00 196.26 True
11 2023-02-15 14:15:00 197.14 False
13 2023-02-16 09:15:00 197.85 True
英文:
You can use boolean indexing
with compare shifted values by Series.ne
and Series.shift
:
out = df[df['bool'].ne(df['bool'].shift())]
print (out)
datetime number bool
0 2023-02-14 10:15:00 195.35 False
4 2023-02-14 14:15:00 195.97 True
6 2023-02-15 09:15:00 196.23 False
8 2023-02-15 11:15:00 196.26 True
11 2023-02-15 14:15:00 197.14 False
13 2023-02-16 09:15:00 197.85 True
答案2
得分: 1
Sure, here's the translated code part:
如果您使用[duplicated][1],则会发生什么
without_duplicates = df.duplicated(['datetime', 'number'], keep='last') & df['bool']
print(df[~without_duplicates])
我尝试的示例:
[![在此输入图像描述][2]][2]
[1]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html
[2]: https://i.stack.imgur.com/qgIPy.png
英文:
What if you use duplicated
without_duplicates = df.duplicated(['datetime', 'number'], keep='last') & df['bool']
print(df[~without_duplicates])
Sample that I tried:
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论