如何从 pandas 数据框中删除具有一些复杂条件的重复项?

huangapple go评论101阅读模式
英文:

How to drop duplicate from a pandas dataframe with some complex conditions?

问题

我正在尝试根据一些条件删除重复项。我的数据框如下所示:

  1. idx a b c d e f
  2. 1 1 ss1 0 25 A B
  3. 2 3 ss7 0 25 A B
  4. 3 5 ss5 0 12 C D
  5. 4 11 im3 0 12 C D
  6. 5 5 ss8 0 50 C K
  7. 6 9 im8 0 5 F G
  8. 7 8 ix6 0 5 F G

如果列 def 的值在数据框中的其他记录中匹配,则认为行是重复的,subset=['d', 'e', 'f']。例如,行1和行2是重复的,行3和行4是重复的,行6和行7是重复的。选择要删除的行基于列 b

  1. 如果列 b 中的值都以 ss 开头(例如行1和行2),则可以删除任何一个。

  2. 如果重复项中有一个以 ss 开头,而另一个以不同格式开头(例如行3和行4),则应保留以 ss 开头的那一个。

  3. 如果列 b 中的重复项都不以 ss 开头(例如行6和行7),则可以选择任何一个。

因此,预期输出应该类似于以下内容:

  1. idx a b c d e f
  2. 2 3 ss7 0 25 A B
  3. 3 5 ss5 0 12 C D
  4. 5 5 ss8 0 50 C K
  5. 7 8 ix6 0 5 F G
英文:

I am trying to drop duplicates, but based on some conditions. My dataframe looks like this:

  1. idx a b c d e f
  2. 1 1 ss1 0 25 A B
  3. 2 3 ss7 0 25 A B
  4. 3 5 ss5 0 12 C D
  5. 4 11 im3 0 12 C D
  6. 5 5 ss8 0 50 C K
  7. 6 9 im8 0 5 F G
  8. 7 8 ix6 0 5 F G

Rows are considered duplicates if the values of columns d, e and f together match other records in the dataframe subset=['d', 'e', 'f']. For example, rows 1 and 2 are duplicates, rows 3 and 4 are duplicates, and rows 6 and 7 are duplicates. The selection of which row to drop is based on column b.

  1. If the value in column b begins with ss for both duplicates (rows 1 and 2), then anyone can be dropped

  2. If one of the duplicates begins with ss and the other begins with a different format (rows 3 and 4), then the one that begins with ss should be kept.

  3. If both duplicates in column b begin with anything other than ss (rows 6 and 7), then anyone can be selected.

Therefore, the expected output should be something like this:

  1. idx a b c d e f
  2. 2 3 ss7 0 25 A B
  3. 3 5 ss5 0 12 C D
  4. 5 5 ss8 0 50 C K
  5. 7 8 ix6 0 5 F G

答案1

得分: 3

b 键首先排序(所有以 'ss' 开头的项目移到末尾),然后从 [‘d’, ‘e’, ‘f’] 中去除重复项(保留最后一个):

  1. out = (df.sort_values('b', key=lambda x: x.str.startswith('ss'))
  2. .drop_duplicates(['d', 'e', 'f'], keep='last').sort_index())

或者

  1. out = (df.sort_values('b', key=lambda x: x.str.startswith('ss'))
  2. .groupby(['d', 'e', 'f'], as_index=False).nth(-1).sort_index())

输出:

  1. >>> out
  2. idx a b c d e f
  3. 1 2 3 ss7 0 25 A B
  4. 2 3 5 ss5 0 12 C D
  5. 4 5 5 ss8 0 50 C K
  6. 6 7 8 ix6 0 5 F G
英文:

Sort by b key first (everything starts by 'ss' is moved to the end) then drop duplicates from ['d', 'e', 'f'] (keep the last):

  1. out = (df.sort_values('b', key=lambda x: x.str.startswith('ss'))
  2. .drop_duplicates(['d', 'e', 'f'], keep='last').sort_index())
  3. # OR
  4. out = (df.sort_values('b', key=lambda x: x.str.startswith('ss'))
  5. .groupby(['d', 'e', 'f'], as_index=False).nth(-1).sort_index())

Output:

  1. >>> out
  2. idx a b c d e f
  3. 1 2 3 ss7 0 25 A B
  4. 2 3 5 ss5 0 12 C D
  5. 4 5 5 ss8 0 50 C K
  6. 6 7 8 ix6 0 5 F G

huangapple
  • 本文由 发表于 2023年3月31日 17:19:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/75896790.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定