2023年5月6日 23:29:32go评论107阅读模式

英文:

How to filter out pandas DF, and create 3 new DFs based on those conditions?

问题

删除不同来源的重复记录。如果同一ID有多个记录，日期相同，只保留来源为B的记录。创建新的DF，如下所示：

name;id;value;source;date
john;id_123;33;B;2023-03-29
peter;id_222;44;B;2023-03-30
mary;id_333;88;A;2023-30-30

以John为例：删除不同来源的重复记录。如果同一ID、相同数值和相同日期，但来源不同，则删除来源A，只保留来源B的记录。然后创建一个新的DF，包含所有原始记录，减去"类似于John"的记录（带有来源A的记录）。新DF如下所示：

name;id;value;source;date
john;id_123;33;B;2023-03-29
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30
mary;id_333;88;A;2023-30-30

以Peter为例：查找所有符合以下条件的记录：相同ID、相同日期、不同数值。保留这些记录并创建一个新的DF。新DF如下所示：

name;id;value;source;date
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30

英文:

So I have a large DF with the following structure:

name;id;value;source;date
john;id_123;33;A;2023-03-29
john;id_123;33;B;2023-03-29
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30
mary;id_333;88;A;2023-30-30

I would like to filter out some results, and create 3 new datasets.

1.Remove duplicates, from different sources. If there is a multiple record for the same ID, with the same date, leave only the record of the source B. Create new DF based on this filtering. New DF should look like this

 name;id;value;source;date
 john;id_123;33;B;2023-03-29
 peter;id_222;44;B;2023-03-30
 mary;id_333;88;A;2023-30-30

2.Example of John: Remove duplicates, from different sources. If there is a same ID, with same value and same date, but different source, I would like to drop source A, and just leave the row with source B. Then create new DF that would contain all original records, minus "john like" records with source A. New DF should look like this:

name;id;value;source;date
john;id_123;33;B;2023-03-29
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30
mary;id_333;88;A;2023-30-30

3.Example of Peter: find all records with following conditions: same ID, same date, different values. Keep both records. Find all of those cases and create a new DF from it.
New DF should look like this:

name;id;value;source;date
peter;id_222;55;A;2023-03-30
peter;id_222;44;B;2023-03-30

答案1

得分: 1

如果同一个ID有多个相同日期的记录，只保留来源为B的记录。

cond1 = df.duplicated(['id', 'date'], keep=False)
cond2 = df['source'].ne('B')
df[~(cond1 & cond2)]

输出：

    name    id      value   source  date
1   john    id_123  33      B       2023-03-29
3   peter   id_222  44      B       2023-03-30
4   mary    id_333  88      A       2023-30-30

如果有相同的ID、相同的值和相同的日期，但来源不同，只保留来源为B的行。

cond3 = df.duplicated(['id', 'value', 'date'], keep=False)
df[~(cond3 & cond2)]

输出：

    name    id      value   source  date
1   john    id_123  33      B       2023-03-29
2   peter   id_222  55      A       2023-03-30
3   peter   id_222  44      B       2023-03-30
4   mary    id_333  88      A       2023-30-30

相同的ID、相同的日期、不同的值。保留这两条记录。

df[cond1 & ~cond3]

输出：

    name    id      value   source  date
2   peter   id_222  55      A       2023-03-30
3   peter   id_222  44      B       2023-03-30

英文:

If there is a multiple record for the same ID, with the same date, leave only the record of the source B

cond1 = df.duplicated([&#39;id&#39;, &#39;date&#39;], keep=False)
cond2 = df[&#39;source&#39;].ne(&#39;B&#39;)
df[~(cond1 &amp; cond2)]

output:

    name	id	    value	source	date
1	john	id_123	33	    B	    2023-03-29
3	peter	id_222	44	    B	    2023-03-30
4	mary	id_333	88	    A	    2023-30-30

If there is a same ID, with same value and same date, but different source, just leave the row with source B.

cond3 = df.duplicated([&#39;id&#39;, &#39;value&#39;, &#39;date&#39;], keep=False)
df[~(cond3 &amp; cond2)]

output

    name	id	    value	source	date
1	john	id_123	33	    B	    2023-03-29
2	peter	id_222	55	    A	    2023-03-30
3	peter	id_222	44	    B	    2023-03-30
4	mary	id_333	88	    A	    2023-30-30

same ID, same date, different values. Keep both records.

df[cond1 &amp; ~cond3]

output:

    name	id	    value	source	date
2	peter	id_222	55	    A	    2023-03-30
3	peter	id_222	44    	B	    2023-03-30

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何过滤 pandas 数据框（DF）并根据这些条件创建三个新的数据框（DF）？

问题

答案1

在SQL表中删除所有行并保留选定的行时出现错误。

A Python script built upon the requests module throws a KeyError when it goes for the next page after grabbing content from the first page

将事件绘制为单个条形图

可以在最后一个整数后拆分pandas列吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。