2023年4月1日 01:00:05go评论70阅读模式

英文:

how to compare two dataframes and return a new dataframe with only the records that have changed

问题

I want to build a python script that will compare two pandas dataframes and create a new df that I can use to update my sql table. I create df1 by reading the existing table. I create df2 by reading the new data through an API call. I want to isolate changed lines and update the SQL table with the new values.

我想构建一个Python脚本，用于比较两个Pandas数据框，并创建一个新的df，以便用于更新我的SQL表。我通过读取现有表格创建df1。通过API调用创建df2以读取新数据。我希望分离出已更改的行并使用新值更新SQL表格。

This function returns the entire dataframe and isn't working as expected:

这个函数返回整个数据框并不按预期工作:

def compare_dataframes(df1, df2, pk_col):
    # Merge the two dataframes on the primary key column
    df_merged = pd.merge(df1, df2, on=pk_col, how='outer', suffixes=('_old', '_new'))

    # Identify the rows that are different between the two dataframes
    df_diff = df_merged[df_merged.isna().any(axis=1)]

    # Drop the columns from the old dataframe and rename the columns from the new dataframe
    df_diff = df_diff.drop(columns=[col for col in df_diff.columns if col.endswith('_old')])
    df_diff = df_diff rename(columns={col: col.replace('_new', '') for col in df_diff.columns})

    return df_diff

这个函数返回整个数据框并且不如预期工作：

英文:

I have attempted to compare through an outer merge, but I need help returning the dataframe with only records with a different value in any field.

Here is my example df1:

Here is my example df2:

My desired output:

This function returns the entire dataframe and isn't working as expected:

def compare_dataframes(df1, df2, pk_col):
    # Merge the two dataframes on the primary key column
    df_merged = pd.merge(df1, df2, on=pk_col, how=&#39;outer&#39;, suffixes=(&#39;_old&#39;, &#39;_new&#39;))

    # Identify the rows that are different between the two dataframes
    df_diff = df_merged[df_merged.isna().any(axis=1)]

    # Drop the columns from the old dataframe and rename the columns from the new dataframe
    df_diff = df_diff.drop(columns=[col for col in df_diff.columns if col.endswith(&#39;_old&#39;)])
    df_diff = df_diff.rename(columns={col: col.replace(&#39;_new&#39;, &#39;&#39;) for col in df_diff.columns})

    return df_diff

答案1

得分: 1

以下是代码的翻译部分：

# 一种方法是将这两个数据框连接起来，然后删除重复项，如下所示：
dict = {1: df1, 2: df2}
df = pd.concat(dict)
df.drop_duplicates(keep=False)

# 如在类似问题的答案中提供的：
import sys
from io import StringIO
import pandas as pd

DF1 = StringIO("""
id,field1,field2,field3,field4
0,x,y,,b
1,x,,,
2,x,y,z,
3,x,y,z,b
4,x,y,,b""")
    
DF2 = StringIO("""
id,field1,field2,field3,field4
0,x,y,,b
1,x,,a,
2,x,y,z,
3,x,y,z,b
4,x,y,a,b""")
    
df1 = pd.read_table(DF1, sep=',', index_col='id')
df2 = pd.read_table(DF2, sep=',', index_col='id')

# 步骤1
dictionary = {1: df1, 2: df2}
df = pd.concat(dictionary)
df3 = df.drop_duplicates(keep=False).reset_index()

# 步骤2
df4 = df3.drop_duplicates(subset=['id'], keep='last')
df4 = df4.drop('level_0', axis=1)
df4.head()

输出如下所示：

id field1 field2 field3 field4
1  1      x      NaN    a      NaN
2  4      x      y      a      b

这是你提供的代码的翻译部分。

英文:

One approach could be to concatenate the 2 dataframes and then remove duplicates as shown below:

dict = {1:df1,2:df2}
df=pd.concat(dict)
df.drop_duplicates(keep=False)

As provided in answer to similar question:
https://stackoverflow.com/a/42649293/21442120

import sys 
from io import StringIO
import pandas as pd

DF1 = StringIO(&quot;&quot;&quot;
id,field1,field2,field3,field4
0,x,y,,b
1,x,,,
2,x,y,z,
3,x,y,z,b
4,x,y,,b&quot;&quot;&quot;)

DF2 = StringIO(&quot;&quot;&quot;
id,field1,field2,field3,field4
0,x,y,,b
1,x,,a,
2,x,y,z,
3,x,y,z,b
4,x,y,a,b
&quot;&quot;&quot;)

df1 = pd.read_table(DF1, sep=&#39;,&#39;, index_col=&#39;id&#39;)
df2 = pd.read_table(DF2, sep=&#39;,&#39;, index_col=&#39;id&#39;)

# STEP 1
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df3 = df.drop_duplicates(keep=False).reset_index()

# STEP 2
df4 = df3.drop_duplicates(subset=[&#39;id&#39;], keep=&#39;last&#39;)
df4 = df4.drop(&#39;level_0&#39;, axis=1)
df4.head()

Gives Output as Desired:

id	field1	field2	field3	field4
1	1	x	NaN	a	NaN
2	4	x	y	a	b

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何比较两个数据框并返回仅包含已更改记录的新数据框。

问题

答案1

以一种奇怪的方式写。

Google服务帐户身份验证Python：401请求具有无效的身份验证凭据

Anaconda在macOS Ventura上安装失败。

ModuleNotFoundError: 在 Visual Studio Code 中找不到模块 'cvzone'

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论