如何过滤 pandas 数据框(DF)并根据这些条件创建三个新的数据框(DF)?

huangapple go评论67阅读模式
英文:

How to filter out pandas DF, and create 3 new DFs based on those conditions?

问题

  1. 删除不同来源的重复记录。如果同一ID有多个记录,日期相同,只保留来源为B的记录。创建新的DF,如下所示:
name;id;value;source;date
john;id_123;33;B;2023-03-29
peter;id_222;44;B;2023-03-30
mary;id_333;88;A;2023-30-30
  1. 以John为例:删除不同来源的重复记录。如果同一ID、相同数值和相同日期,但来源不同,则删除来源A,只保留来源B的记录。然后创建一个新的DF,包含所有原始记录,减去"类似于John"的记录(带有来源A的记录)。新DF如下所示:
name;id;value;source;date
john;id_123;33;B;2023-03-29
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30
mary;id_333;88;A;2023-30-30
  1. 以Peter为例:查找所有符合以下条件的记录:相同ID、相同日期、不同数值。保留这些记录并创建一个新的DF。新DF如下所示:
name;id;value;source;date
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30
英文:

So I have a large DF with the following structure:

name;id;value;source;date
john;id_123;33;A;2023-03-29
john;id_123;33;B;2023-03-29
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30
mary;id_333;88;A;2023-30-30

I would like to filter out some results, and create 3 new datasets.

1.Remove duplicates, from different sources. If there is a multiple record for the same ID, with the same date, leave only the record of the source B. Create new DF based on this filtering. New DF should look like this

 name;id;value;source;date
 john;id_123;33;B;2023-03-29
 peter;id_222;44;B;2023-03-30
 mary;id_333;88;A;2023-30-30

2.Example of John: Remove duplicates, from different sources. If there is a same ID, with same value and same date, but different source, I would like to drop source A, and just leave the row with source B. Then create new DF that would contain all original records, minus "john like" records with source A. New DF should look like this:

name;id;value;source;date
john;id_123;33;B;2023-03-29
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30
mary;id_333;88;A;2023-30-30

3.Example of Peter: find all records with following conditions: same ID, same date, different values. Keep both records. Find all of those cases and create a new DF from it.
New DF should look like this:

name;id;value;source;date
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30 

答案1

得分: 1

1

如果同一个ID有多个相同日期的记录,只保留来源为B的记录。

cond1 = df.duplicated(['id', 'date'], keep=False)
cond2 = df['source'].ne('B')
df[~(cond1 & cond2)]

输出:

    name    id      value   source  date
1   john    id_123  33      B       2023-03-29
3   peter   id_222  44      B       2023-03-30
4   mary    id_333  88      A       2023-30-30

2

如果有相同的ID、相同的值和相同的日期,但来源不同,只保留来源为B的行。

cond3 = df.duplicated(['id', 'value', 'date'], keep=False)
df[~(cond3 & cond2)]

输出:

    name    id      value   source  date
1   john    id_123  33      B       2023-03-29
2   peter   id_222  55      A       2023-03-30
3   peter   id_222  44      B       2023-03-30
4   mary    id_333  88      A       2023-30-30

3

相同的ID、相同的日期、不同的值。保留这两条记录。

df[cond1 & ~cond3]

输出:

    name    id      value   source  date
2   peter   id_222  55      A       2023-03-30
3   peter   id_222  44      B       2023-03-30
英文:

1

If there is a multiple record for the same ID, with the same date, leave only the record of the source B

cond1 = df.duplicated(['id', 'date'], keep=False)
cond2 = df['source'].ne('B')
df[~(cond1 & cond2)]

output:

    name	id	    value	source	date
1	john	id_123	33	    B	    2023-03-29
3	peter	id_222	44	    B	    2023-03-30
4	mary	id_333	88	    A	    2023-30-30

2

If there is a same ID, with same value and same date, but different source, just leave the row with source B.

cond3 = df.duplicated(['id', 'value', 'date'], keep=False)
df[~(cond3 & cond2)]

output

    name	id	    value	source	date
1	john	id_123	33	    B	    2023-03-29
2	peter	id_222	55	    A	    2023-03-30
3	peter	id_222	44	    B	    2023-03-30
4	mary	id_333	88	    A	    2023-30-30

3

same ID, same date, different values. Keep both records.

df[cond1 & ~cond3]

output:

    name	id	    value	source	date
2	peter	id_222	55	    A	    2023-03-30
3	peter	id_222	44    	B	    2023-03-30

huangapple
  • 本文由 发表于 2023年5月6日 23:29:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76189707.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定