英文:
how to compare two dataframes and return a new dataframe with only the records that have changed
问题
I want to build a python script that will compare two pandas dataframes and create a new df
that I can use to update my sql table. I create df1
by reading the existing table. I create df2
by reading the new data through an API call. I want to isolate changed lines and update the SQL table with the new values.
我想构建一个Python脚本,用于比较两个Pandas数据框,并创建一个新的df
,以便用于更新我的SQL表。我通过读取现有表格创建df1
。通过API调用创建df2
以读取新数据。我希望分离出已更改的行并使用新值更新SQL表格。
This function returns the entire dataframe and isn't working as expected:
这个函数返回整个数据框并不按预期工作:
def compare_dataframes(df1, df2, pk_col):
# Merge the two dataframes on the primary key column
df_merged = pd.merge(df1, df2, on=pk_col, how='outer', suffixes=('_old', '_new'))
# Identify the rows that are different between the two dataframes
df_diff = df_merged[df_merged.isna().any(axis=1)]
# Drop the columns from the old dataframe and rename the columns from the new dataframe
df_diff = df_diff.drop(columns=[col for col in df_diff.columns if col.endswith('_old')])
df_diff = df_diff rename(columns={col: col.replace('_new', '') for col in df_diff.columns})
return df_diff
这个函数返回整个数据框并且不如预期工作:
英文:
I want to build a python script that will compare two pandas dataframes and create a new df
that I can use to update my sql table. I create df1
by reading the existing table. I create df2
by reading the new data through an API call. I want to isolate changed lines and update the SQL table with the new values.
I have attempted to compare through an outer merge, but I need help returning the dataframe with only records with a different value in any field.
Here is my example df1
:
Here is my example df2
:
My desired output:
This function returns the entire dataframe and isn't working as expected:
def compare_dataframes(df1, df2, pk_col):
# Merge the two dataframes on the primary key column
df_merged = pd.merge(df1, df2, on=pk_col, how='outer', suffixes=('_old', '_new'))
# Identify the rows that are different between the two dataframes
df_diff = df_merged[df_merged.isna().any(axis=1)]
# Drop the columns from the old dataframe and rename the columns from the new dataframe
df_diff = df_diff.drop(columns=[col for col in df_diff.columns if col.endswith('_old')])
df_diff = df_diff.rename(columns={col: col.replace('_new', '') for col in df_diff.columns})
return df_diff
答案1
得分: 1
以下是代码的翻译部分:
# 一种方法是将这两个数据框连接起来,然后删除重复项,如下所示:
dict = {1: df1, 2: df2}
df = pd.concat(dict)
df.drop_duplicates(keep=False)
# 如在类似问题的答案中提供的:
import sys
from io import StringIO
import pandas as pd
DF1 = StringIO("""
id,field1,field2,field3,field4
0,x,y,,b
1,x,,,
2,x,y,z,
3,x,y,z,b
4,x,y,,b""")
DF2 = StringIO("""
id,field1,field2,field3,field4
0,x,y,,b
1,x,,a,
2,x,y,z,
3,x,y,z,b
4,x,y,a,b""")
df1 = pd.read_table(DF1, sep=',', index_col='id')
df2 = pd.read_table(DF2, sep=',', index_col='id')
# 步骤1
dictionary = {1: df1, 2: df2}
df = pd.concat(dictionary)
df3 = df.drop_duplicates(keep=False).reset_index()
# 步骤2
df4 = df3.drop_duplicates(subset=['id'], keep='last')
df4 = df4.drop('level_0', axis=1)
df4.head()
输出如下所示:
id field1 field2 field3 field4
1 1 x NaN a NaN
2 4 x y a b
这是你提供的代码的翻译部分。
英文:
One approach could be to concatenate the 2 dataframes and then remove duplicates as shown below:
dict = {1:df1,2:df2}
df=pd.concat(dict)
df.drop_duplicates(keep=False)
As provided in answer to similar question:
https://stackoverflow.com/a/42649293/21442120
import sys
from io import StringIO
import pandas as pd
DF1 = StringIO("""
id,field1,field2,field3,field4
0,x,y,,b
1,x,,,
2,x,y,z,
3,x,y,z,b
4,x,y,,b""")
DF2 = StringIO("""
id,field1,field2,field3,field4
0,x,y,,b
1,x,,a,
2,x,y,z,
3,x,y,z,b
4,x,y,a,b
""")
df1 = pd.read_table(DF1, sep=',', index_col='id')
df2 = pd.read_table(DF2, sep=',', index_col='id')
# STEP 1
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df3 = df.drop_duplicates(keep=False).reset_index()
# STEP 2
df4 = df3.drop_duplicates(subset=['id'], keep='last')
df4 = df4.drop('level_0', axis=1)
df4.head()
Gives Output as Desired:
id field1 field2 field3 field4
1 1 x NaN a NaN
2 4 x y a b
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论