英文:
How to drop duplicated rows based on pattern change in dataframe?
问题
假设我有一个像下面这样的数据框,其中包含特定日期的各个类别的百分比,以及唯一ID:
import pandas as pd
df = pd.DataFrame({"ID":["A","A","A","A","A","A","A"],
"DATE":["01-1990","03-1990","04-1990","05-1990","06-1990","07-1990",
"08-1990"],
"CLASS A":[30,30,0,0,0,30,30],
"CLASS B":[50,50,50,50,50,50,50],
"CLASS C":[20,20,50,50,50,20,20]})
df
Out[4]:
ID DATE CLASS A CLASS B CLASS C
0 A 01-1990 30 50 20
1 A 03-1990 30 50 20
2 A 04-1990 0 50 50
3 A 05-1990 0 50 50
4 A 06-1990 0 50 50
5 A 07-1990 30 50 20
6 A 08-1990 30 50 20
我想根据ID,CLASS A,CLASS B和CLASS C删除重复的行(保留第一行),但只在它改变为另一种百分比模式之前这样做。在这个例子中,有2个模式的变化(30/50/20到0/50/50,然后再到30/50/20)。结果应该如下所示:
```python
ID DATE CLASS A CLASS B CLASS C
0 A 01-1990 30 50 20
2 A 04-1990 0 50 50
5 A 07-1990 30 50 20
我知道如何基于整个数据框删除重复的行(df.drop_duplicates
),但在这种情况下无法直接做到这一点,因为这样做会将索引5和6的行也删除。有人能帮帮我吗?
英文:
Imagine I have a dataframe like this one below, with percentages by classes in specific dates for unique IDs:
import pandas as pd
df = pd.DataFrame({"ID":["A","A","A","A","A","A","A"],
"DATE":["01-1990","03-1990","04-1990","05-1990","06-1990","07-1990",
"08-1990"],
"CLASS A":[30,30,0,0,0,30,30],
"CLASS B":[50,50,50,50,50,50,50],
"CLASS C":[20,20,50,50,50,20,20]})
df
Out[4]:
ID DATE CLASS A CLASS B CLASS C
0 A 01-1990 30 50 20
1 A 03-1990 30 50 20
2 A 04-1990 0 50 50
3 A 05-1990 0 50 50
4 A 06-1990 0 50 50
5 A 07-1990 30 50 20
6 A 08-1990 30 50 20
I would like to drop duplicated rows based on ID, CLASS A, CLASS B and CLASS C (and keep the first one), but only before it changes to another pattern of percentage. In this example, there are 2 changes of pattern (30/50/20 to 0/50/50 and then to 30/50/20 again). The result should be like this:
ID DATE CLASS A CLASS B CLASS C
0 A 01-1990 30 50 20
2 A 04-1990 0 50 50
5 A 07-1990 30 50 20
I know how to remove duplicated rows based on the whole dataframe (df.drop_duplicates
), but can't do this directly in this case as it would remove the rows from index 5 and 6 as well. Anyone could help me?
答案1
得分: 3
在您的情况下,我不会使用 drop_duplicates
,而是使用 shift
方法获取要保留的索引。
类似以下代码:
compare_df = df[["ID", "CLASS A", "CLASS B", "CLASS C"]]
row_is_like_previous_one = (compare_df == compare_df.shift(1)).all(axis=1)
result = df[~row_is_like_previous_one]
英文:
In your case, I wouldn't use drop_duplicates
but get the indices to keep using the shift
method.
Something like:
compare_df = df[["ID", "CLASS A", "CLASS B", "CLASS C"]]
row_is_like_previous_one = (compare_df == compare_df.shift(1)).all(axis=1)
result = df[~row_is_like_previous_one]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论