根据列删除Pandas数据帧中的重复项。

huangapple go评论64阅读模式
英文:

Remove duplicates in a Pandas data frame based on a column

问题

以下是翻译好的内容:

我有以下数据集,我想根据布尔列删除重复项。

日期时间 数值 布尔
2023-02-14 10:15:00 195.35 FALSE
2023-02-14 11:15:00 195.8 FALSE
2023-02-14 12:15:00 195.87 FALSE
2023-02-14 13:15:00 196.06 FALSE
2023-02-14 14:15:00 195.97 TRUE
2023-02-14 15:15:00 195.98 TRUE
2023-02-15 09:15:00 196.23 FALSE
2023-02-15 10:15:00 196.3 FALSE
2023-02-15 11:15:00 196.26 TRUE
2023-02-15 12:15:00 196.4 TRUE
2023-02-15 13:15:00 196.28 TRUE
2023-02-15 14:15:00 197.14 FALSE
2023-02-15 15:15:00 197.08 FALSE
2023-02-16 09:15:00 197.85 TRUE
2023-02-16 10:15:00 198.01 TRUE

结果数据应该如下所示:

日期时间 数值 布尔
2023-02-14 10:15:00 195.35 FALSE
2023-02-14 14:15:00 195.97 TRUE
2023-02-15 09:15:00 196.23 FALSE
2023-02-15 11:15:00 196.26 TRUE
2023-02-15 14:15:00 197.14 FALSE
2023-02-16 09:15:00 197.85 TRUE

我尝试过使用Pandas的drop_duplicates,但这会将整个布尔列分组,然后删除重复项,这将导致只剩下2行。

PS:我可能只需循环遍历所有行并与前一行进行比较,但我正在寻找一种Pandas的原生方法来执行此操作,如果存在的话。

英文:

I have below data-set and I want to remove duplicates based on the bool column.

datetime number bool
2023-02-14 10:15:00 195.35 FALSE
2023-02-14 11:15:00 195.8 FALSE
2023-02-14 12:15:00 195.87 FALSE
2023-02-14 13:15:00 196.06 FALSE
2023-02-14 14:15:00 195.97 TRUE
2023-02-14 15:15:00 195.98 TRUE
2023-02-15 09:15:00 196.23 FALSE
2023-02-15 10:15:00 196.3 FALSE
2023-02-15 11:15:00 196.26 TRUE
2023-02-15 12:15:00 196.4 TRUE
2023-02-15 13:15:00 196.28 TRUE
2023-02-15 14:15:00 197.14 FALSE
2023-02-15 15:15:00 197.08 FALSE
2023-02-16 09:15:00 197.85 TRUE
2023-02-16 10:15:00 198.01 TRUE

Resulting data should look like this

datetime number bool
2023-02-14 10:15:00 195.35 FALSE
2023-02-14 14:15:00 195.97 TRUE
2023-02-15 09:15:00 196.23 FALSE
2023-02-15 11:15:00 196.26 TRUE
2023-02-15 14:15:00 197.14 FALSE
2023-02-16 09:15:00 197.85 TRUE

I tried pandas drop_duplicates but this will group the whole bool column and then removes duplicates, that will result in only 2 rows.

PS: I might just loop through all rows and compare to the previous but I am looking for some Panda's native way of doing this, if it exists.

答案1

得分: 1

你可以使用 boolean indexing 通过 Series.neSeries.shift 来比较偏移的值:

out = df[df['bool'].ne(df['bool'].shift())]
print (out)
               datetime  number   bool
0   2023-02-14 10:15:00  195.35  False
4   2023-02-14 14:15:00  195.97   True
6   2023-02-15 09:15:00  196.23  False
8   2023-02-15 11:15:00  196.26   True
11  2023-02-15 14:15:00  197.14  False
13  2023-02-16 09:15:00  197.85   True
英文:

You can use boolean indexing with compare shifted values by Series.ne and Series.shift:

out = df[df['bool'].ne(df['bool'].shift())]
print (out)
               datetime  number   bool
0   2023-02-14 10:15:00  195.35  False
4   2023-02-14 14:15:00  195.97   True
6   2023-02-15 09:15:00  196.23  False
8   2023-02-15 11:15:00  196.26   True
11  2023-02-15 14:15:00  197.14  False
13  2023-02-16 09:15:00  197.85   True

答案2

得分: 1

Sure, here's the translated code part:

如果您使用[duplicated][1],则会发生什么

    without_duplicates = df.duplicated(['datetime', 'number'], keep='last') & df['bool']
    print(df[~without_duplicates])

我尝试的示例:

[![在此输入图像描述][2]][2]

[1]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html
[2]: https://i.stack.imgur.com/qgIPy.png
英文:

What if you use duplicated

without_duplicates = df.duplicated(['datetime', 'number'], keep='last') & df['bool']
print(df[~without_duplicates])

Sample that I tried:

根据列删除Pandas数据帧中的重复项。

huangapple
  • 本文由 发表于 2023年4月19日 17:30:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76052903.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定