英文:
How to filter out pandas DF, and create 3 new DFs based on those conditions?
问题
- 删除不同来源的重复记录。如果同一ID有多个记录,日期相同,只保留来源为B的记录。创建新的DF,如下所示:
name;id;value;source;date
john;id_123;33;B;2023-03-29
peter;id_222;44;B;2023-03-30
mary;id_333;88;A;2023-30-30
- 以John为例:删除不同来源的重复记录。如果同一ID、相同数值和相同日期,但来源不同,则删除来源A,只保留来源B的记录。然后创建一个新的DF,包含所有原始记录,减去"类似于John"的记录(带有来源A的记录)。新DF如下所示:
name;id;value;source;date
john;id_123;33;B;2023-03-29
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30
mary;id_333;88;A;2023-30-30
- 以Peter为例:查找所有符合以下条件的记录:相同ID、相同日期、不同数值。保留这些记录并创建一个新的DF。新DF如下所示:
name;id;value;source;date
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30
英文:
So I have a large DF with the following structure:
name;id;value;source;date
john;id_123;33;A;2023-03-29
john;id_123;33;B;2023-03-29
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30
mary;id_333;88;A;2023-30-30
I would like to filter out some results, and create 3 new datasets.
1.Remove duplicates, from different sources. If there is a multiple record for the same ID, with the same date, leave only the record of the source B. Create new DF based on this filtering. New DF should look like this
name;id;value;source;date
john;id_123;33;B;2023-03-29
peter;id_222;44;B;2023-03-30
mary;id_333;88;A;2023-30-30
2.Example of John: Remove duplicates, from different sources. If there is a same ID, with same value and same date, but different source, I would like to drop source A, and just leave the row with source B. Then create new DF that would contain all original records, minus "john like" records with source A. New DF should look like this:
name;id;value;source;date
john;id_123;33;B;2023-03-29
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30
mary;id_333;88;A;2023-30-30
3.Example of Peter: find all records with following conditions: same ID, same date, different values. Keep both records. Find all of those cases and create a new DF from it.
New DF should look like this:
name;id;value;source;date
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30
答案1
得分: 1
1
如果同一个ID有多个相同日期的记录,只保留来源为B的记录。
cond1 = df.duplicated(['id', 'date'], keep=False)
cond2 = df['source'].ne('B')
df[~(cond1 & cond2)]
输出:
name id value source date
1 john id_123 33 B 2023-03-29
3 peter id_222 44 B 2023-03-30
4 mary id_333 88 A 2023-30-30
2
如果有相同的ID、相同的值和相同的日期,但来源不同,只保留来源为B的行。
cond3 = df.duplicated(['id', 'value', 'date'], keep=False)
df[~(cond3 & cond2)]
输出:
name id value source date
1 john id_123 33 B 2023-03-29
2 peter id_222 55 A 2023-03-30
3 peter id_222 44 B 2023-03-30
4 mary id_333 88 A 2023-30-30
3
相同的ID、相同的日期、不同的值。保留这两条记录。
df[cond1 & ~cond3]
输出:
name id value source date
2 peter id_222 55 A 2023-03-30
3 peter id_222 44 B 2023-03-30
英文:
1
If there is a multiple record for the same ID, with the same date, leave only the record of the source B
cond1 = df.duplicated(['id', 'date'], keep=False)
cond2 = df['source'].ne('B')
df[~(cond1 & cond2)]
output:
name id value source date
1 john id_123 33 B 2023-03-29
3 peter id_222 44 B 2023-03-30
4 mary id_333 88 A 2023-30-30
2
If there is a same ID, with same value and same date, but different source, just leave the row with source B.
cond3 = df.duplicated(['id', 'value', 'date'], keep=False)
df[~(cond3 & cond2)]
output
name id value source date
1 john id_123 33 B 2023-03-29
2 peter id_222 55 A 2023-03-30
3 peter id_222 44 B 2023-03-30
4 mary id_333 88 A 2023-30-30
3
same ID, same date, different values. Keep both records.
df[cond1 & ~cond3]
output:
name id value source date
2 peter id_222 55 A 2023-03-30
3 peter id_222 44 B 2023-03-30
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论