Python Pandas DataFrame Merge on Columns with Overwrite

huangapple go评论77阅读模式
英文:

Python Pandas DataFrame Merge on Columns with Overwrite

问题

以下是代码的翻译部分:

# 合并两个 Pandas 数据框,根据指定列匹配并保留,但覆盖其余列的值,可以使用以下方法。

# 首先,导入 Pandas 库:
import pandas as pd

# 创建两个数据框 df1 和 df2,其中包含列 "Name"、"Gender" 和 "Age" 以及其他列。
# df1 包含以下数据:
df1 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df1.loc[0] = ["Bob", "Male", "21", "2023-01-01", "2023-01-01"]
df1.loc[1] = ["Frank", "Male", "22", "2023-02-01", "2023-02-01"]
df1.loc[2] = ["Steve", "Male", "23", "2023-03-01", "2023-03-01"]
df1.loc[3] = ["John", "Male", "24", "2023-04-01", "2023-04-01"]

# df2 包含以下数据:
df2 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df2.loc[0] = ["Steve", "Male", "23", "2022-11-01", "2022-11-02"]
df2.loc[1] = ["Simon", "Male", "23", "2023-03-01", "2023-03-02"]
df2.loc[2] = ["Gary", "Male", "24", "2023-04-01", "2023-04-02"]
df2.loc[3] = ["Bob", "Male", "21", "2022-12-01", "2022-12-01"]

# 你想要的是将 df1 更新为 df2 中匹配 "Name"、"Gender" 和 "Age" 列的值,而不考虑其他列的内容。
# 你可以使用 Pandas 的 merge 方法进行合并:
df3 = df1.merge(df2, on=["Name", "Gender", "Age"], how='left')

# 然后,你可以使用 fillna 方法填充新创建的列,然后删除不需要的列:
df3['LastLogin'] = df3['LastLogin_y'].fillna(df3['LastLogin_x'])
df3['LastPurchase'] = df3['LastPurchase_y'].fillna(df3['LastPurchase_x'])
df3.drop(['LastLogin_x', 'LastLogin_y'], axis=1, inplace=True)
df3.drop(['LastPurchase_x', 'LastPurchase_y'], axis=1, inplace=True)

# 最终,你将得到更新后的 df1 数据框:
df3

希望这有助于你实现你的目标。

英文:

Is there a way to merge two Pandas DataFrames, by matching on (and retaining) supplied columns, but overwriting all the rest?

For example:

import pandas as pd

df1 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df1.loc[0] = ["Bob", "Male", "21", "2023-01-01", "2023-01-01"]
df1.loc[1] = ["Frank", "Male", "22", "2023-02-01", "2023-02-01"]
df1.loc[2] = ["Steve", "Male", "23", "2023-03-01", "2023-03-01"]
df1.loc[3] = ["John", "Male", "24", "2023-04-01", "2023-04-01"]

df2 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df2.loc[0] = ["Steve", "Male", "23", "2022-11-01", "2022-11-02"]
df2.loc[1] = ["Simon", "Male", "23", "2023-03-01", "2023-03-02"]
df2.loc[2] = ["Gary", "Male", "24", "2023-04-01", "2023-04-02"]
df2.loc[3] = ["Bob", "Male", "21", "2022-12-01", "2022-12-01"]

>>> df1
    Name Gender Age   LastLogin LastPurchase
0    Bob   Male  21  2023-01-01   2023-01-01
1  Frank   Male  22  2023-02-01   2023-02-01
2  Steve   Male  23  2023-03-01   2023-03-01
3   John   Male  24  2023-04-01   2023-04-01

>>> df2
    Name Gender Age   LastLogin LastPurchase
0  Steve   Male  23  2022-11-01   2022-11-02
1  Simon   Male  23  2023-03-01   2023-03-02
2   Gary   Male  24  2023-04-01   2023-04-02
3    Bob   Male  21  2022-12-01   2022-12-01

What I'd like is to end up with is df1 updated with values from df2, if the "Name", "Gender" and "Age" columns match. But without caring what the other columns are, so I'd end up with this:

>>> df1
    Name Gender Age   LastLogin LastPurchase
0    Bob   Male  21  2022-12-01   2022-12-01 # Updated last two columns from df2
1  Frank   Male  22  2023-02-01   2023-02-01
2  Steve   Male  23  2022-11-01   2022-11-02 # Updated last two columns from df2
3   John   Male  24  2023-04-01   2023-04-01

I can do a merge like this:

>>> df3 = df1.merge(df2, on=["Name", "Gender", "Age"], how='left')

But then I have to manually extract data from and drop the new columns created from the merge, using their names:

>>> df3['LastLogin'] = df3['LastLogin_y'].fillna(df3['LastLogin_x'])
>>> df3['LastPurchase'] = df3['LastPurchase_y'].fillna(df3['LastPurchase_x'])
>>> df3.drop(['LastLogin_x', 'LastLogin_y'], axis=1, inplace=True)
>>> df3.drop(['LastPurchase_x', 'LastPurchase_y'], axis=1, inplace=True)
>>> 
>>> df3
    Name Gender Age   LastLogin LastPurchase
0    Bob   Male  21  2022-12-01   2022-12-01
1  Frank   Male  22  2023-02-01   2023-02-01
2  Steve   Male  23  2022-11-01   2022-11-02
3   John   Male  24  2023-04-01   2023-04-01

I'm trying to avoid this, as I need a generic way to update batches of data, and I don't know all their column names (just the ones I want to match on).

答案1

得分: 2

你可以通过仅切片df1中的合并键,避免使用_x/_y列,然后使用fillna/combine_first与原始数据合并:

cols = ["Name", "Gender", "Age"]

df3 = df1[cols].merge(df2, how='left').fillna(df1)

另一种更复杂的方法使用索引:

df3 = (df2.set_index(cols)
          .combine_first(df1.set_index(cols))
          .reindex(df1[cols]).reset_index()
       )

输出:

    Name Gender Age   LastLogin LastPurchase
0    Bob   Male  21  2022-12-01   2022-12-01
1  Frank   Male  22  2023-02-01   2023-02-01
2  Steve   Male  23  2022-11-01   2022-11-02
3   John   Male  24  2023-04-01   2023-04-01
英文:

You can avoid the _x/_y columns by slicing only the the merging keys in df1 for merge, then fillna/combine_first with the original:

cols = ["Name", "Gender", "Age"]
df3 = df1[cols].merge(df2, how='left').fillna(df1)

A more convoluted approach using indexes:

df3 = (df2.set_index(cols)
.combine_first(df1.set_index(cols))
.reindex(df1[cols]).reset_index()
)

Output:

    Name Gender Age   LastLogin LastPurchase
0    Bob   Male  21  2022-12-01   2022-12-01
1  Frank   Male  22  2023-02-01   2023-02-01
2  Steve   Male  23  2022-11-01   2022-11-02
3   John   Male  24  2023-04-01   2023-04-01

huangapple
  • 本文由 发表于 2023年2月23日 21:11:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75545314.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定