英文:
Python Pandas DataFrame Merge on Columns with Overwrite
问题
以下是代码的翻译部分:
# 合并两个 Pandas 数据框,根据指定列匹配并保留,但覆盖其余列的值,可以使用以下方法。
# 首先,导入 Pandas 库:
import pandas as pd
# 创建两个数据框 df1 和 df2,其中包含列 "Name"、"Gender" 和 "Age" 以及其他列。
# df1 包含以下数据:
df1 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df1.loc[0] = ["Bob", "Male", "21", "2023-01-01", "2023-01-01"]
df1.loc[1] = ["Frank", "Male", "22", "2023-02-01", "2023-02-01"]
df1.loc[2] = ["Steve", "Male", "23", "2023-03-01", "2023-03-01"]
df1.loc[3] = ["John", "Male", "24", "2023-04-01", "2023-04-01"]
# df2 包含以下数据:
df2 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df2.loc[0] = ["Steve", "Male", "23", "2022-11-01", "2022-11-02"]
df2.loc[1] = ["Simon", "Male", "23", "2023-03-01", "2023-03-02"]
df2.loc[2] = ["Gary", "Male", "24", "2023-04-01", "2023-04-02"]
df2.loc[3] = ["Bob", "Male", "21", "2022-12-01", "2022-12-01"]
# 你想要的是将 df1 更新为 df2 中匹配 "Name"、"Gender" 和 "Age" 列的值,而不考虑其他列的内容。
# 你可以使用 Pandas 的 merge 方法进行合并:
df3 = df1.merge(df2, on=["Name", "Gender", "Age"], how='left')
# 然后,你可以使用 fillna 方法填充新创建的列,然后删除不需要的列:
df3['LastLogin'] = df3['LastLogin_y'].fillna(df3['LastLogin_x'])
df3['LastPurchase'] = df3['LastPurchase_y'].fillna(df3['LastPurchase_x'])
df3.drop(['LastLogin_x', 'LastLogin_y'], axis=1, inplace=True)
df3.drop(['LastPurchase_x', 'LastPurchase_y'], axis=1, inplace=True)
# 最终,你将得到更新后的 df1 数据框:
df3
希望这有助于你实现你的目标。
英文:
Is there a way to merge two Pandas DataFrames, by matching on (and retaining) supplied columns, but overwriting all the rest?
For example:
import pandas as pd
df1 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df1.loc[0] = ["Bob", "Male", "21", "2023-01-01", "2023-01-01"]
df1.loc[1] = ["Frank", "Male", "22", "2023-02-01", "2023-02-01"]
df1.loc[2] = ["Steve", "Male", "23", "2023-03-01", "2023-03-01"]
df1.loc[3] = ["John", "Male", "24", "2023-04-01", "2023-04-01"]
df2 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df2.loc[0] = ["Steve", "Male", "23", "2022-11-01", "2022-11-02"]
df2.loc[1] = ["Simon", "Male", "23", "2023-03-01", "2023-03-02"]
df2.loc[2] = ["Gary", "Male", "24", "2023-04-01", "2023-04-02"]
df2.loc[3] = ["Bob", "Male", "21", "2022-12-01", "2022-12-01"]
>>> df1
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2023-01-01 2023-01-01
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2023-03-01 2023-03-01
3 John Male 24 2023-04-01 2023-04-01
>>> df2
Name Gender Age LastLogin LastPurchase
0 Steve Male 23 2022-11-01 2022-11-02
1 Simon Male 23 2023-03-01 2023-03-02
2 Gary Male 24 2023-04-01 2023-04-02
3 Bob Male 21 2022-12-01 2022-12-01
What I'd like is to end up with is df1 updated with values from df2, if the "Name", "Gender" and "Age" columns match. But without caring what the other columns are, so I'd end up with this:
>>> df1
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2022-12-01 2022-12-01 # Updated last two columns from df2
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2022-11-01 2022-11-02 # Updated last two columns from df2
3 John Male 24 2023-04-01 2023-04-01
I can do a merge like this:
>>> df3 = df1.merge(df2, on=["Name", "Gender", "Age"], how='left')
But then I have to manually extract data from and drop the new columns created from the merge, using their names:
>>> df3['LastLogin'] = df3['LastLogin_y'].fillna(df3['LastLogin_x'])
>>> df3['LastPurchase'] = df3['LastPurchase_y'].fillna(df3['LastPurchase_x'])
>>> df3.drop(['LastLogin_x', 'LastLogin_y'], axis=1, inplace=True)
>>> df3.drop(['LastPurchase_x', 'LastPurchase_y'], axis=1, inplace=True)
>>>
>>> df3
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2022-12-01 2022-12-01
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2022-11-01 2022-11-02
3 John Male 24 2023-04-01 2023-04-01
I'm trying to avoid this, as I need a generic way to update batches of data, and I don't know all their column names (just the ones I want to match on).
答案1
得分: 2
你可以通过仅切片df1中的合并键,避免使用_x/_y列,然后使用fillna/combine_first与原始数据合并:
cols = ["Name", "Gender", "Age"]
df3 = df1[cols].merge(df2, how='left').fillna(df1)
另一种更复杂的方法使用索引:
df3 = (df2.set_index(cols)
.combine_first(df1.set_index(cols))
.reindex(df1[cols]).reset_index()
)
输出:
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2022-12-01 2022-12-01
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2022-11-01 2022-11-02
3 John Male 24 2023-04-01 2023-04-01
英文:
You can avoid the _x/_y columns by slicing only the the merging keys in df1 for merge, then fillna/combine_first with the original:
cols = ["Name", "Gender", "Age"]
df3 = df1[cols].merge(df2, how='left').fillna(df1)
A more convoluted approach using indexes:
df3 = (df2.set_index(cols)
.combine_first(df1.set_index(cols))
.reindex(df1[cols]).reset_index()
)
Output:
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2022-12-01 2022-12-01
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2022-11-01 2022-11-02
3 John Male 24 2023-04-01 2023-04-01
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论