Pandas: 如何提高性能,比较组内的行

huangapple go评论107阅读模式
英文:

Pandas :How to improve performance, comparing rows inside groups

问题

I have done a python program to compare rows inside groups.But the performances are poor. The data are coming from a Change Data Capture system. For every change, there is a Sequence id , and an Operation number. For an Update operation, there is two rows: One with Operation=3 (previous value ) and one with Operation=4 (new value). The columns with no changes are set to null but a value can change from "Somevalue" to NULL so i need to compare row 3 and 4 to know when it's a Null because the value is really Null or because there is no change.

This is an example of the source data :

Source data

This is the output required :

Desired outcome

Bellow my code with the same mockup data :

  1. import pandas as pd
  2. import numpy as np
  3. d={'_Change-Sequence':[1,1,2,2,3,3],
  4. '_Operation':[3,4,3,4,3,4],
  5. 'Dossier_x':[1,1,2,2,3,3],
  6. 'IsCovidPositiv':['Yes','No','No',np.NaN,'Yes','Yes'],
  7. 'Status':[np.NaN,'KO',np.NaN,np.NaN,np.NaN,np.NaN]
  8. }
  9. df_update=pd.DataFrame(data=d)
  10. print(df_update)
  11. for column in [column for column in df_update.columns if column not in {'index','Dossier_x'} if not column.startswith('_')]:
  12. column_previous_name=column+"_Previous|"
  13. df_update[column_previous_name]=df_update.groupby('_Change-Sequence')[column].shift()
  14. df_update[column]=df_update.apply(lambda x:x[column] if x[column_previous_name]!=x[column] else np.nan,axis=1)
  15. df_update.drop(column_previous_name,axis=1,inplace=True)
  16. df_update=df_update[df_update['_Operation']==4]
  17. df_update

Online version of the code

The output is as required. Only one line per group ( Same Change Sequence ) with the the value for each non meta or PK column ( column starting with "_" or index and "Dossier_x") if it changed and NaN if it didn't change. I need to do so for every columns ( i don't know the name of the columns in advance )

Regards

Vincent

The program is working ( in the question) but the performance are bad.

英文:

I have done a python program to compare rows inside groups.But the performances are poor. The data are coming from a Change Data Capture system. For every change, there is a Sequence id , and an Operation number. For an Update operation, there is two rows: One with Operation=3 (previous value ) and one with Operation=4 (new value). The columns with no changes are set to null but a value can change from "Somevalue" to NULL so i need to compare row 3 and 4 to know when it's a Null because the value is really Null or because there is no change.

This is an example of the source data :

Source data

This is the output required :

Desired outcome

Bellow my code with the same mockup data :

  1. import pandas as pd
  2. import numpy as np
  3. d={'_Change-Sequence':[1,1,2,2,3,3],
  4. '_Operation':[3,4,3,4,3,4],
  5. 'Dossier_x':[1,1,2,2,3,3],
  6. 'IsCovidPositiv':['Yes','No','No',np.NaN,'Yes','Yes'],
  7. 'Status':[np.NaN,'KO',np.NaN,np.NaN,np.NaN,np.NaN]
  8. }
  9. df_update=pd.DataFrame(data=d)
  10. print(df_update)
  11. for column in [column for column in df_update.columns if column not in {'index','Dossier_x'} if not column.startswith('_')]:
  12. column_previous_name=column+"_Previous|"
  13. df_update[column_previous_name]=df_update.groupby('_Change-Sequence')[column].shift()
  14. df_update[column]=df_update.apply(lambda x:x[column] if x[column_previous_name]!=x[column] else np.nan,axis=1)
  15. df_update.drop(column_previous_name,axis=1,inplace=True)
  16. df_update=df_update[df_update['_Operation']==4]
  17. df_update

Online version of the code

The output is as required. Only one line per group ( Same Change Sequence ) with the the value for each non meta or PK column ( column starting with "_" or index and "Dossier_x") if it changed and NaN if it didn't change. I need to do so for every columns ( i don't know the name of the columns in advance )

Regards

Vincent

The program is working ( in the question) but the performance are bad.

答案1

得分: 0

如果我正确理解你的逻辑,你可以简化你的代码如下:

  1. cols = [column for column in df_update.columns if column not in {'index', 'Dossier_x'}
  2. if not column.startswith('_')]
  3. # 获取移位后的数值
  4. tmp = df_update.groupby('_Change-Sequence')[cols].shift()
  5. # 屏蔽相等的数值并切片
  6. out = df_update.mask(df_update.eq(tmp, axis=0)).loc[df_update['_Operation'].eq(4)]

输出:

  1. _Change-Sequence _Operation Dossier_x IsCovidPositiv Status
  2. 1 1 4 1 No KO
  3. 3 2 4 2 NaN NaN
  4. 5 3 4 3 NaN NaN
英文:

If I understood correctly your logic, you could simplify your code to:

  1. cols = [column for column in df_update.columns if column not in {'index','Dossier_x'}
  2. if not column.startswith('_')]
  3. # get shifted values
  4. tmp = df_update.groupby('_Change-Sequence')[cols].shift()
  5. # mask equal values and slice
  6. out = df_update.mask(df_update.eq(tmp, axis=0)).loc[df_update['_Operation'].eq(4)]

Output:

  1. _Change-Sequence _Operation Dossier_x IsCovidPositiv Status
  2. 1 1 4 1 No KO
  3. 3 2 4 2 NaN NaN
  4. 5 3 4 3 NaN NaN

huangapple
  • 本文由 发表于 2023年4月4日 16:13:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75926986.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定