如何比较两个数据框并返回仅包含已更改记录的新数据框。

huangapple go评论70阅读模式
英文:

how to compare two dataframes and return a new dataframe with only the records that have changed

问题

I want to build a python script that will compare two pandas dataframes and create a new df that I can use to update my sql table. I create df1 by reading the existing table. I create df2 by reading the new data through an API call. I want to isolate changed lines and update the SQL table with the new values.

我想构建一个Python脚本,用于比较两个Pandas数据框,并创建一个新的df,以便用于更新我的SQL表。我通过读取现有表格创建df1。通过API调用创建df2以读取新数据。我希望分离出已更改的行并使用新值更新SQL表格。

This function returns the entire dataframe and isn't working as expected:

这个函数返回整个数据框并不按预期工作:

def compare_dataframes(df1, df2, pk_col):
    # Merge the two dataframes on the primary key column
    df_merged = pd.merge(df1, df2, on=pk_col, how='outer', suffixes=('_old', '_new'))

    # Identify the rows that are different between the two dataframes
    df_diff = df_merged[df_merged.isna().any(axis=1)]

    # Drop the columns from the old dataframe and rename the columns from the new dataframe
    df_diff = df_diff.drop(columns=[col for col in df_diff.columns if col.endswith('_old')])
    df_diff = df_diff rename(columns={col: col.replace('_new', '') for col in df_diff.columns})

    return df_diff

这个函数返回整个数据框并且不如预期工作:

英文:

I want to build a python script that will compare two pandas dataframes and create a new df that I can use to update my sql table. I create df1 by reading the existing table. I create df2 by reading the new data through an API call. I want to isolate changed lines and update the SQL table with the new values.

I have attempted to compare through an outer merge, but I need help returning the dataframe with only records with a different value in any field.

Here is my example df1:

如何比较两个数据框并返回仅包含已更改记录的新数据框。

Here is my example df2:

如何比较两个数据框并返回仅包含已更改记录的新数据框。

My desired output:

如何比较两个数据框并返回仅包含已更改记录的新数据框。

This function returns the entire dataframe and isn't working as expected:

def compare_dataframes(df1, df2, pk_col):
    # Merge the two dataframes on the primary key column
    df_merged = pd.merge(df1, df2, on=pk_col, how='outer', suffixes=('_old', '_new'))

    # Identify the rows that are different between the two dataframes
    df_diff = df_merged[df_merged.isna().any(axis=1)]

    # Drop the columns from the old dataframe and rename the columns from the new dataframe
    df_diff = df_diff.drop(columns=[col for col in df_diff.columns if col.endswith('_old')])
    df_diff = df_diff.rename(columns={col: col.replace('_new', '') for col in df_diff.columns})

    return df_diff

答案1

得分: 1

以下是代码的翻译部分:

# 一种方法是将这两个数据框连接起来,然后删除重复项,如下所示:
dict = {1: df1, 2: df2}
df = pd.concat(dict)
df.drop_duplicates(keep=False)

# 如在类似问题的答案中提供的:
import sys
from io import StringIO
import pandas as pd

DF1 = StringIO("""
id,field1,field2,field3,field4
0,x,y,,b
1,x,,,
2,x,y,z,
3,x,y,z,b
4,x,y,,b""")
    
DF2 = StringIO("""
id,field1,field2,field3,field4
0,x,y,,b
1,x,,a,
2,x,y,z,
3,x,y,z,b
4,x,y,a,b""")
    
df1 = pd.read_table(DF1, sep=',', index_col='id')
df2 = pd.read_table(DF2, sep=',', index_col='id')

# 步骤1
dictionary = {1: df1, 2: df2}
df = pd.concat(dictionary)
df3 = df.drop_duplicates(keep=False).reset_index()

# 步骤2
df4 = df3.drop_duplicates(subset=['id'], keep='last')
df4 = df4.drop('level_0', axis=1)
df4.head()

输出如下所示:

id field1 field2 field3 field4
1  1      x      NaN    a      NaN
2  4      x      y      a      b

这是你提供的代码的翻译部分。

英文:

One approach could be to concatenate the 2 dataframes and then remove duplicates as shown below:

dict = {1:df1,2:df2}
df=pd.concat(dict)
df.drop_duplicates(keep=False)

As provided in answer to similar question:
https://stackoverflow.com/a/42649293/21442120

import sys 
from io import StringIO
import pandas as pd

DF1 = StringIO("""
id,field1,field2,field3,field4
0,x,y,,b
1,x,,,
2,x,y,z,
3,x,y,z,b
4,x,y,,b""")

DF2 = StringIO("""
id,field1,field2,field3,field4
0,x,y,,b
1,x,,a,
2,x,y,z,
3,x,y,z,b
4,x,y,a,b
""")

df1 = pd.read_table(DF1, sep=',', index_col='id')
df2 = pd.read_table(DF2, sep=',', index_col='id')

# STEP 1
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df3 = df.drop_duplicates(keep=False).reset_index()

# STEP 2
df4 = df3.drop_duplicates(subset=['id'], keep='last')
df4 = df4.drop('level_0', axis=1)
df4.head()

Gives Output as Desired:

id	field1	field2	field3	field4
1	1	x	NaN	a	NaN
2	4	x	y	a	b

huangapple
  • 本文由 发表于 2023年4月1日 01:00:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/75900997.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定