英文:
Python Pandas DataFrame Merge on Columns with Overwrite
问题
以下是代码的翻译部分:
# 合并两个 Pandas 数据框,根据指定列匹配并保留,但覆盖其余列的值,可以使用以下方法。
# 首先,导入 Pandas 库:
import pandas as pd
# 创建两个数据框 df1 和 df2,其中包含列 "Name"、"Gender" 和 "Age" 以及其他列。
# df1 包含以下数据:
df1 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df1.loc[0] = ["Bob", "Male", "21", "2023-01-01", "2023-01-01"]
df1.loc[1] = ["Frank", "Male", "22", "2023-02-01", "2023-02-01"]
df1.loc[2] = ["Steve", "Male", "23", "2023-03-01", "2023-03-01"]
df1.loc[3] = ["John", "Male", "24", "2023-04-01", "2023-04-01"]
# df2 包含以下数据:
df2 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df2.loc[0] = ["Steve", "Male", "23", "2022-11-01", "2022-11-02"]
df2.loc[1] = ["Simon", "Male", "23", "2023-03-01", "2023-03-02"]
df2.loc[2] = ["Gary", "Male", "24", "2023-04-01", "2023-04-02"]
df2.loc[3] = ["Bob", "Male", "21", "2022-12-01", "2022-12-01"]
# 你想要的是将 df1 更新为 df2 中匹配 "Name"、"Gender" 和 "Age" 列的值,而不考虑其他列的内容。
# 你可以使用 Pandas 的 merge 方法进行合并:
df3 = df1.merge(df2, on=["Name", "Gender", "Age"], how='left')
# 然后,你可以使用 fillna 方法填充新创建的列,然后删除不需要的列:
df3['LastLogin'] = df3['LastLogin_y'].fillna(df3['LastLogin_x'])
df3['LastPurchase'] = df3['LastPurchase_y'].fillna(df3['LastPurchase_x'])
df3.drop(['LastLogin_x', 'LastLogin_y'], axis=1, inplace=True)
df3.drop(['LastPurchase_x', 'LastPurchase_y'], axis=1, inplace=True)
# 最终,你将得到更新后的 df1 数据框:
df3
希望这有助于你实现你的目标。
英文:
Is there a way to merge two Pandas DataFrames, by matching on (and retaining) supplied columns, but overwriting all the rest?
For example:
import pandas as pd
df1 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df1.loc[0] = ["Bob", "Male", "21", "2023-01-01", "2023-01-01"]
df1.loc[1] = ["Frank", "Male", "22", "2023-02-01", "2023-02-01"]
df1.loc[2] = ["Steve", "Male", "23", "2023-03-01", "2023-03-01"]
df1.loc[3] = ["John", "Male", "24", "2023-04-01", "2023-04-01"]
df2 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df2.loc[0] = ["Steve", "Male", "23", "2022-11-01", "2022-11-02"]
df2.loc[1] = ["Simon", "Male", "23", "2023-03-01", "2023-03-02"]
df2.loc[2] = ["Gary", "Male", "24", "2023-04-01", "2023-04-02"]
df2.loc[3] = ["Bob", "Male", "21", "2022-12-01", "2022-12-01"]
>>> df1
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2023-01-01 2023-01-01
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2023-03-01 2023-03-01
3 John Male 24 2023-04-01 2023-04-01
>>> df2
Name Gender Age LastLogin LastPurchase
0 Steve Male 23 2022-11-01 2022-11-02
1 Simon Male 23 2023-03-01 2023-03-02
2 Gary Male 24 2023-04-01 2023-04-02
3 Bob Male 21 2022-12-01 2022-12-01
What I'd like is to end up with is df1
updated with values from df2
, if the "Name"
, "Gender"
and "Age"
columns match. But without caring what the other columns are, so I'd end up with this:
>>> df1
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2022-12-01 2022-12-01 # Updated last two columns from df2
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2022-11-01 2022-11-02 # Updated last two columns from df2
3 John Male 24 2023-04-01 2023-04-01
I can do a merge like this:
>>> df3 = df1.merge(df2, on=["Name", "Gender", "Age"], how='left')
But then I have to manually extract data from and drop the new columns created from the merge, using their names:
>>> df3['LastLogin'] = df3['LastLogin_y'].fillna(df3['LastLogin_x'])
>>> df3['LastPurchase'] = df3['LastPurchase_y'].fillna(df3['LastPurchase_x'])
>>> df3.drop(['LastLogin_x', 'LastLogin_y'], axis=1, inplace=True)
>>> df3.drop(['LastPurchase_x', 'LastPurchase_y'], axis=1, inplace=True)
>>>
>>> df3
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2022-12-01 2022-12-01
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2022-11-01 2022-11-02
3 John Male 24 2023-04-01 2023-04-01
I'm trying to avoid this, as I need a generic way to update batches of data, and I don't know all their column names (just the ones I want to match on).
答案1
得分: 2
你可以通过仅切片df1
中的合并键,避免使用_x
/_y
列,然后使用fillna
/combine_first
与原始数据合并:
cols = ["Name", "Gender", "Age"]
df3 = df1[cols].merge(df2, how='left').fillna(df1)
另一种更复杂的方法使用索引:
df3 = (df2.set_index(cols)
.combine_first(df1.set_index(cols))
.reindex(df1[cols]).reset_index()
)
输出:
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2022-12-01 2022-12-01
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2022-11-01 2022-11-02
3 John Male 24 2023-04-01 2023-04-01
英文:
You can avoid the _x
/_y
columns by slicing only the the merging keys in df1
for merge
, then fillna
/combine_first
with the original:
cols = ["Name", "Gender", "Age"]
df3 = df1[cols].merge(df2, how='left').fillna(df1)
A more convoluted approach using indexes:
df3 = (df2.set_index(cols)
.combine_first(df1.set_index(cols))
.reindex(df1[cols]).reset_index()
)
Output:
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2022-12-01 2022-12-01
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2022-11-01 2022-11-02
3 John Male 24 2023-04-01 2023-04-01
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论