2023年4月4日 16:13:12go评论107阅读模式

英文:

Pandas :How to improve performance, comparing rows inside groups

问题

I have done a python program to compare rows inside groups.But the performances are poor. The data are coming from a Change Data Capture system. For every change, there is a Sequence id , and an Operation number. For an Update operation, there is two rows: One with Operation=3 (previous value ) and one with Operation=4 (new value). The columns with no changes are set to null but a value can change from "Somevalue" to NULL so i need to compare row 3 and 4 to know when it's a Null because the value is really Null or because there is no change.

This is an example of the source data :

Source data

This is the output required :

Desired outcome

Bellow my code with the same mockup data :

import pandas as pd
import numpy as np
d={&#39;_Change-Sequence&#39;:[1,1,2,2,3,3],
   &#39;_Operation&#39;:[3,4,3,4,3,4],
   &#39;Dossier_x&#39;:[1,1,2,2,3,3],
   &#39;IsCovidPositiv&#39;:[&#39;Yes&#39;,&#39;No&#39;,&#39;No&#39;,np.NaN,&#39;Yes&#39;,&#39;Yes&#39;],
   &#39;Status&#39;:[np.NaN,&#39;KO&#39;,np.NaN,np.NaN,np.NaN,np.NaN]
  }
df_update=pd.DataFrame(data=d)
print(df_update)
for column in [column for column in df_update.columns if column not in {&#39;index&#39;,&#39;Dossier_x&#39;} if not column.startswith(&#39;_&#39;)]:
  column_previous_name=column+&quot;_Previous|&quot;
  df_update[column_previous_name]=df_update.groupby(&#39;_Change-Sequence&#39;)[column].shift()
  df_update[column]=df_update.apply(lambda x:x[column] if x[column_previous_name]!=x[column]  else np.nan,axis=1)
  df_update.drop(column_previous_name,axis=1,inplace=True)
df_update=df_update[df_update[&#39;_Operation&#39;]==4]
df_update

Online version of the code

The output is as required. Only one line per group ( Same Change Sequence ) with the the value for each non meta or PK column ( column starting with "_" or index and "Dossier_x") if it changed and NaN if it didn't change. I need to do so for every columns ( i don't know the name of the columns in advance )

Regards

Vincent

The program is working ( in the question) but the performance are bad.

英文:

This is an example of the source data :

Source data

This is the output required :

Desired outcome

Bellow my code with the same mockup data :

import pandas as pd
import numpy as np
d={&#39;_Change-Sequence&#39;:[1,1,2,2,3,3],
   &#39;_Operation&#39;:[3,4,3,4,3,4],
   &#39;Dossier_x&#39;:[1,1,2,2,3,3],
   &#39;IsCovidPositiv&#39;:[&#39;Yes&#39;,&#39;No&#39;,&#39;No&#39;,np.NaN,&#39;Yes&#39;,&#39;Yes&#39;],
   &#39;Status&#39;:[np.NaN,&#39;KO&#39;,np.NaN,np.NaN,np.NaN,np.NaN]
  }
df_update=pd.DataFrame(data=d)
print(df_update)
for column in [column for column in df_update.columns if column not in {&#39;index&#39;,&#39;Dossier_x&#39;} if not column.startswith(&#39;_&#39;)]:
  column_previous_name=column+&quot;_Previous|&quot;
  df_update[column_previous_name]=df_update.groupby(&#39;_Change-Sequence&#39;)[column].shift()
  df_update[column]=df_update.apply(lambda x:x[column] if x[column_previous_name]!=x[column]  else np.nan,axis=1)
  df_update.drop(column_previous_name,axis=1,inplace=True)
df_update=df_update[df_update[&#39;_Operation&#39;]==4]
df_update

Online version of the code

Regards

Vincent

The program is working ( in the question) but the performance are bad.

答案1

得分: 0

如果我正确理解你的逻辑，你可以简化你的代码如下：

cols = [column for column in df_update.columns if column not in {'index', 'Dossier_x'}
        if not column.startswith('_')]
# 获取移位后的数值
tmp = df_update.groupby('_Change-Sequence')[cols].shift()
# 屏蔽相等的数值并切片
out = df_update.mask(df_update.eq(tmp, axis=0)).loc[df_update['_Operation'].eq(4)]

输出：

   _Change-Sequence  _Operation  Dossier_x IsCovidPositiv Status
1                 1           4          1             No     KO
3                 2           4          2            NaN    NaN
5                 3           4          3            NaN    NaN

英文:

If I understood correctly your logic, you could simplify your code to:

cols = [column for column in df_update.columns if column not in {&#39;index&#39;,&#39;Dossier_x&#39;}
        if not column.startswith(&#39;_&#39;)]
# get shifted values
tmp = df_update.groupby(&#39;_Change-Sequence&#39;)[cols].shift()
# mask equal values and slice
out = df_update.mask(df_update.eq(tmp, axis=0)).loc[df_update[&#39;_Operation&#39;].eq(4)]

Output:

   _Change-Sequence  _Operation  Dossier_x IsCovidPositiv Status
1                 1           4          1             No     KO
3                 2           4          2            NaN    NaN
5                 3           4          3            NaN    NaN

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas: 如何提高性能，比较组内的行

问题

答案1

如何在多行上使用Flake8拆分for语句？

使用Google Drive API和Python从文件名生成文件结构

如何使用Python填充所有ID的缺失日期

如何修复ActionNoPermission

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。