如何过滤 pandas 数据框(DF)并根据这些条件创建三个新的数据框(DF)?

huangapple go评论97阅读模式
英文:

How to filter out pandas DF, and create 3 new DFs based on those conditions?

问题

  1. 删除不同来源的重复记录。如果同一ID有多个记录,日期相同,只保留来源为B的记录。创建新的DF,如下所示:
  1. name;id;value;source;date
  2. john;id_123;33;B;2023-03-29
  3. peter;id_222;44;B;2023-03-30
  4. mary;id_333;88;A;2023-30-30
  1. 以John为例:删除不同来源的重复记录。如果同一ID、相同数值和相同日期,但来源不同,则删除来源A,只保留来源B的记录。然后创建一个新的DF,包含所有原始记录,减去"类似于John"的记录(带有来源A的记录)。新DF如下所示:
  1. name;id;value;source;date
  2. john;id_123;33;B;2023-03-29
  3. peter;id_222;55;A;2023-03-30
  4. peter;id_222;44;B;2023-03-30
  5. mary;id_333;88;A;2023-30-30
  1. 以Peter为例:查找所有符合以下条件的记录:相同ID、相同日期、不同数值。保留这些记录并创建一个新的DF。新DF如下所示:
  1. name;id;value;source;date
  2. peter;id_222;55;A;2023-03-30
  3. peter;id_222;44;B;2023-03-30
英文:

So I have a large DF with the following structure:

  1. name;id;value;source;date
  2. john;id_123;33;A;2023-03-29
  3. john;id_123;33;B;2023-03-29
  4. peter;id_222;55;A;2023-03-30
  5. peter;id_222;44;B;2023-03-30
  6. mary;id_333;88;A;2023-30-30

I would like to filter out some results, and create 3 new datasets.

1.Remove duplicates, from different sources. If there is a multiple record for the same ID, with the same date, leave only the record of the source B. Create new DF based on this filtering. New DF should look like this

  1. name;id;value;source;date
  2. john;id_123;33;B;2023-03-29
  3. peter;id_222;44;B;2023-03-30
  4. mary;id_333;88;A;2023-30-30

2.Example of John: Remove duplicates, from different sources. If there is a same ID, with same value and same date, but different source, I would like to drop source A, and just leave the row with source B. Then create new DF that would contain all original records, minus "john like" records with source A. New DF should look like this:

  1. name;id;value;source;date
  2. john;id_123;33;B;2023-03-29
  3. peter;id_222;55;A;2023-03-30
  4. peter;id_222;44;B;2023-03-30
  5. mary;id_333;88;A;2023-30-30

3.Example of Peter: find all records with following conditions: same ID, same date, different values. Keep both records. Find all of those cases and create a new DF from it.
New DF should look like this:

  1. name;id;value;source;date
  2. peter;id_222;55;A;2023-03-30
  3. peter;id_222;44;B;2023-03-30

答案1

得分: 1

1

如果同一个ID有多个相同日期的记录,只保留来源为B的记录。

  1. cond1 = df.duplicated(['id', 'date'], keep=False)
  2. cond2 = df['source'].ne('B')
  3. df[~(cond1 & cond2)]

输出:

  1. name id value source date
  2. 1 john id_123 33 B 2023-03-29
  3. 3 peter id_222 44 B 2023-03-30
  4. 4 mary id_333 88 A 2023-30-30

2

如果有相同的ID、相同的值和相同的日期,但来源不同,只保留来源为B的行。

  1. cond3 = df.duplicated(['id', 'value', 'date'], keep=False)
  2. df[~(cond3 & cond2)]

输出:

  1. name id value source date
  2. 1 john id_123 33 B 2023-03-29
  3. 2 peter id_222 55 A 2023-03-30
  4. 3 peter id_222 44 B 2023-03-30
  5. 4 mary id_333 88 A 2023-30-30

3

相同的ID、相同的日期、不同的值。保留这两条记录。

  1. df[cond1 & ~cond3]

输出:

  1. name id value source date
  2. 2 peter id_222 55 A 2023-03-30
  3. 3 peter id_222 44 B 2023-03-30
英文:

1

If there is a multiple record for the same ID, with the same date, leave only the record of the source B

  1. cond1 = df.duplicated(['id', 'date'], keep=False)
  2. cond2 = df['source'].ne('B')
  3. df[~(cond1 & cond2)]

output:

  1. name id value source date
  2. 1 john id_123 33 B 2023-03-29
  3. 3 peter id_222 44 B 2023-03-30
  4. 4 mary id_333 88 A 2023-30-30

2

If there is a same ID, with same value and same date, but different source, just leave the row with source B.

  1. cond3 = df.duplicated(['id', 'value', 'date'], keep=False)
  2. df[~(cond3 & cond2)]

output

  1. name id value source date
  2. 1 john id_123 33 B 2023-03-29
  3. 2 peter id_222 55 A 2023-03-30
  4. 3 peter id_222 44 B 2023-03-30
  5. 4 mary id_333 88 A 2023-30-30

3

same ID, same date, different values. Keep both records.

  1. df[cond1 & ~cond3]

output:

  1. name id value source date
  2. 2 peter id_222 55 A 2023-03-30
  3. 3 peter id_222 44 B 2023-03-30

huangapple
  • 本文由 发表于 2023年5月6日 23:29:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76189707.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定