如何根据数据框中的模式变化删除重复的行?

huangapple go评论96阅读模式
英文:

How to drop duplicated rows based on pattern change in dataframe?

问题

假设我有一个像下面这样的数据框,其中包含特定日期的各个类别的百分比,以及唯一ID:

  1. import pandas as pd
  2. df = pd.DataFrame({"ID":["A","A","A","A","A","A","A"],
  3. "DATE":["01-1990","03-1990","04-1990","05-1990","06-1990","07-1990",
  4. "08-1990"],
  5. "CLASS A":[30,30,0,0,0,30,30],
  6. "CLASS B":[50,50,50,50,50,50,50],
  7. "CLASS C":[20,20,50,50,50,20,20]})
  8. df
  9. Out[4]:
  10. ID DATE CLASS A CLASS B CLASS C
  11. 0 A 01-1990 30 50 20
  12. 1 A 03-1990 30 50 20
  13. 2 A 04-1990 0 50 50
  14. 3 A 05-1990 0 50 50
  15. 4 A 06-1990 0 50 50
  16. 5 A 07-1990 30 50 20
  17. 6 A 08-1990 30 50 20
  18. 我想根据IDCLASS ACLASS BCLASS C删除重复的行保留第一行),但只在它改变为另一种百分比模式之前这样做在这个例子中2个模式的变化30/50/200/50/50然后再到30/50/20)。结果应该如下所示
  19. ```python
  20. ID DATE CLASS A CLASS B CLASS C
  21. 0 A 01-1990 30 50 20
  22. 2 A 04-1990 0 50 50
  23. 5 A 07-1990 30 50 20

我知道如何基于整个数据框删除重复的行(df.drop_duplicates),但在这种情况下无法直接做到这一点,因为这样做会将索引5和6的行也删除。有人能帮帮我吗?

英文:

Imagine I have a dataframe like this one below, with percentages by classes in specific dates for unique IDs:

  1. import pandas as pd
  2. df = pd.DataFrame({"ID":["A","A","A","A","A","A","A"],
  3. "DATE":["01-1990","03-1990","04-1990","05-1990","06-1990","07-1990",
  4. "08-1990"],
  5. "CLASS A":[30,30,0,0,0,30,30],
  6. "CLASS B":[50,50,50,50,50,50,50],
  7. "CLASS C":[20,20,50,50,50,20,20]})
  8. df
  9. Out[4]:
  10. ID DATE CLASS A CLASS B CLASS C
  11. 0 A 01-1990 30 50 20
  12. 1 A 03-1990 30 50 20
  13. 2 A 04-1990 0 50 50
  14. 3 A 05-1990 0 50 50
  15. 4 A 06-1990 0 50 50
  16. 5 A 07-1990 30 50 20
  17. 6 A 08-1990 30 50 20

I would like to drop duplicated rows based on ID, CLASS A, CLASS B and CLASS C (and keep the first one), but only before it changes to another pattern of percentage. In this example, there are 2 changes of pattern (30/50/20 to 0/50/50 and then to 30/50/20 again). The result should be like this:

  1. ID DATE CLASS A CLASS B CLASS C
  2. 0 A 01-1990 30 50 20
  3. 2 A 04-1990 0 50 50
  4. 5 A 07-1990 30 50 20

I know how to remove duplicated rows based on the whole dataframe (df.drop_duplicates), but can't do this directly in this case as it would remove the rows from index 5 and 6 as well. Anyone could help me?

答案1

得分: 3

在您的情况下,我不会使用 drop_duplicates,而是使用 shift 方法获取要保留的索引。

类似以下代码:

  1. compare_df = df[["ID", "CLASS A", "CLASS B", "CLASS C"]]
  2. row_is_like_previous_one = (compare_df == compare_df.shift(1)).all(axis=1)
  3. result = df[~row_is_like_previous_one]
英文:

In your case, I wouldn't use drop_duplicates but get the indices to keep using the shift method.

Something like:

  1. compare_df = df[["ID", "CLASS A", "CLASS B", "CLASS C"]]
  2. row_is_like_previous_one = (compare_df == compare_df.shift(1)).all(axis=1)
  3. result = df[~row_is_like_previous_one]

huangapple
  • 本文由 发表于 2023年6月26日 05:25:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76552469.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定