Python Pandas DataFrame Merge on Columns with Overwrite

huangapple go评论113阅读模式
英文:

Python Pandas DataFrame Merge on Columns with Overwrite

问题

以下是代码的翻译部分:

  1. # 合并两个 Pandas 数据框,根据指定列匹配并保留,但覆盖其余列的值,可以使用以下方法。
  2. # 首先,导入 Pandas 库:
  3. import pandas as pd
  4. # 创建两个数据框 df1 和 df2,其中包含列 "Name"、"Gender" 和 "Age" 以及其他列。
  5. # df1 包含以下数据:
  6. df1 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
  7. df1.loc[0] = ["Bob", "Male", "21", "2023-01-01", "2023-01-01"]
  8. df1.loc[1] = ["Frank", "Male", "22", "2023-02-01", "2023-02-01"]
  9. df1.loc[2] = ["Steve", "Male", "23", "2023-03-01", "2023-03-01"]
  10. df1.loc[3] = ["John", "Male", "24", "2023-04-01", "2023-04-01"]
  11. # df2 包含以下数据:
  12. df2 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
  13. df2.loc[0] = ["Steve", "Male", "23", "2022-11-01", "2022-11-02"]
  14. df2.loc[1] = ["Simon", "Male", "23", "2023-03-01", "2023-03-02"]
  15. df2.loc[2] = ["Gary", "Male", "24", "2023-04-01", "2023-04-02"]
  16. df2.loc[3] = ["Bob", "Male", "21", "2022-12-01", "2022-12-01"]
  17. # 你想要的是将 df1 更新为 df2 中匹配 "Name"、"Gender" 和 "Age" 列的值,而不考虑其他列的内容。
  18. # 你可以使用 Pandas 的 merge 方法进行合并:
  19. df3 = df1.merge(df2, on=["Name", "Gender", "Age"], how='left')
  20. # 然后,你可以使用 fillna 方法填充新创建的列,然后删除不需要的列:
  21. df3['LastLogin'] = df3['LastLogin_y'].fillna(df3['LastLogin_x'])
  22. df3['LastPurchase'] = df3['LastPurchase_y'].fillna(df3['LastPurchase_x'])
  23. df3.drop(['LastLogin_x', 'LastLogin_y'], axis=1, inplace=True)
  24. df3.drop(['LastPurchase_x', 'LastPurchase_y'], axis=1, inplace=True)
  25. # 最终,你将得到更新后的 df1 数据框:
  26. df3

希望这有助于你实现你的目标。

英文:

Is there a way to merge two Pandas DataFrames, by matching on (and retaining) supplied columns, but overwriting all the rest?

For example:

  1. import pandas as pd
  2. df1 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
  3. df1.loc[0] = ["Bob", "Male", "21", "2023-01-01", "2023-01-01"]
  4. df1.loc[1] = ["Frank", "Male", "22", "2023-02-01", "2023-02-01"]
  5. df1.loc[2] = ["Steve", "Male", "23", "2023-03-01", "2023-03-01"]
  6. df1.loc[3] = ["John", "Male", "24", "2023-04-01", "2023-04-01"]
  7. df2 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
  8. df2.loc[0] = ["Steve", "Male", "23", "2022-11-01", "2022-11-02"]
  9. df2.loc[1] = ["Simon", "Male", "23", "2023-03-01", "2023-03-02"]
  10. df2.loc[2] = ["Gary", "Male", "24", "2023-04-01", "2023-04-02"]
  11. df2.loc[3] = ["Bob", "Male", "21", "2022-12-01", "2022-12-01"]
  12. >>> df1
  13. Name Gender Age LastLogin LastPurchase
  14. 0 Bob Male 21 2023-01-01 2023-01-01
  15. 1 Frank Male 22 2023-02-01 2023-02-01
  16. 2 Steve Male 23 2023-03-01 2023-03-01
  17. 3 John Male 24 2023-04-01 2023-04-01
  18. >>> df2
  19. Name Gender Age LastLogin LastPurchase
  20. 0 Steve Male 23 2022-11-01 2022-11-02
  21. 1 Simon Male 23 2023-03-01 2023-03-02
  22. 2 Gary Male 24 2023-04-01 2023-04-02
  23. 3 Bob Male 21 2022-12-01 2022-12-01

What I'd like is to end up with is df1 updated with values from df2, if the "Name", "Gender" and "Age" columns match. But without caring what the other columns are, so I'd end up with this:

  1. >>> df1
  2. Name Gender Age LastLogin LastPurchase
  3. 0 Bob Male 21 2022-12-01 2022-12-01 # Updated last two columns from df2
  4. 1 Frank Male 22 2023-02-01 2023-02-01
  5. 2 Steve Male 23 2022-11-01 2022-11-02 # Updated last two columns from df2
  6. 3 John Male 24 2023-04-01 2023-04-01

I can do a merge like this:

  1. >>> df3 = df1.merge(df2, on=["Name", "Gender", "Age"], how='left')

But then I have to manually extract data from and drop the new columns created from the merge, using their names:

  1. >>> df3['LastLogin'] = df3['LastLogin_y'].fillna(df3['LastLogin_x'])
  2. >>> df3['LastPurchase'] = df3['LastPurchase_y'].fillna(df3['LastPurchase_x'])
  3. >>> df3.drop(['LastLogin_x', 'LastLogin_y'], axis=1, inplace=True)
  4. >>> df3.drop(['LastPurchase_x', 'LastPurchase_y'], axis=1, inplace=True)
  5. >>>
  6. >>> df3
  7. Name Gender Age LastLogin LastPurchase
  8. 0 Bob Male 21 2022-12-01 2022-12-01
  9. 1 Frank Male 22 2023-02-01 2023-02-01
  10. 2 Steve Male 23 2022-11-01 2022-11-02
  11. 3 John Male 24 2023-04-01 2023-04-01

I'm trying to avoid this, as I need a generic way to update batches of data, and I don't know all their column names (just the ones I want to match on).

答案1

得分: 2

你可以通过仅切片df1中的合并键,避免使用_x/_y列,然后使用fillna/combine_first与原始数据合并:

  1. cols = ["Name", "Gender", "Age"]
  2. df3 = df1[cols].merge(df2, how='left').fillna(df1)

另一种更复杂的方法使用索引:

  1. df3 = (df2.set_index(cols)
  2. .combine_first(df1.set_index(cols))
  3. .reindex(df1[cols]).reset_index()
  4. )

输出:

  1. Name Gender Age LastLogin LastPurchase
  2. 0 Bob Male 21 2022-12-01 2022-12-01
  3. 1 Frank Male 22 2023-02-01 2023-02-01
  4. 2 Steve Male 23 2022-11-01 2022-11-02
  5. 3 John Male 24 2023-04-01 2023-04-01
英文:

You can avoid the _x/_y columns by slicing only the the merging keys in df1 for merge, then fillna/combine_first with the original:

  1. cols = ["Name", "Gender", "Age"]
  2. df3 = df1[cols].merge(df2, how='left').fillna(df1)

A more convoluted approach using indexes:

  1. df3 = (df2.set_index(cols)
  2. .combine_first(df1.set_index(cols))
  3. .reindex(df1[cols]).reset_index()
  4. )

Output:

  1. Name Gender Age LastLogin LastPurchase
  2. 0 Bob Male 21 2022-12-01 2022-12-01
  3. 1 Frank Male 22 2023-02-01 2023-02-01
  4. 2 Steve Male 23 2022-11-01 2022-11-02
  5. 3 John Male 24 2023-04-01 2023-04-01

huangapple
  • 本文由 发表于 2023年2月23日 21:11:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75545314.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定