2023年2月23日 21:11:38go评论113阅读模式

英文:

Python Pandas DataFrame Merge on Columns with Overwrite

问题

以下是代码的翻译部分：

# 合并两个 Pandas 数据框，根据指定列匹配并保留，但覆盖其余列的值，可以使用以下方法。
# 首先，导入 Pandas 库：
import pandas as pd
# 创建两个数据框 df1 和 df2，其中包含列 "Name"、"Gender" 和 "Age" 以及其他列。
# df1 包含以下数据：
df1 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df1.loc[0] = ["Bob", "Male", "21", "2023-01-01", "2023-01-01"]
df1.loc[1] = ["Frank", "Male", "22", "2023-02-01", "2023-02-01"]
df1.loc[2] = ["Steve", "Male", "23", "2023-03-01", "2023-03-01"]
df1.loc[3] = ["John", "Male", "24", "2023-04-01", "2023-04-01"]
# df2 包含以下数据：
df2 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df2.loc[0] = ["Steve", "Male", "23", "2022-11-01", "2022-11-02"]
df2.loc[1] = ["Simon", "Male", "23", "2023-03-01", "2023-03-02"]
df2.loc[2] = ["Gary", "Male", "24", "2023-04-01", "2023-04-02"]
df2.loc[3] = ["Bob", "Male", "21", "2022-12-01", "2022-12-01"]
# 你想要的是将 df1 更新为 df2 中匹配 "Name"、"Gender" 和 "Age" 列的值，而不考虑其他列的内容。
# 你可以使用 Pandas 的 merge 方法进行合并：
df3 = df1.merge(df2, on=["Name", "Gender", "Age"], how='left')
# 然后，你可以使用 fillna 方法填充新创建的列，然后删除不需要的列：
df3['LastLogin'] = df3['LastLogin_y'].fillna(df3['LastLogin_x'])
df3['LastPurchase'] = df3['LastPurchase_y'].fillna(df3['LastPurchase_x'])
df3.drop(['LastLogin_x', 'LastLogin_y'], axis=1, inplace=True)
df3.drop(['LastPurchase_x', 'LastPurchase_y'], axis=1, inplace=True)
# 最终，你将得到更新后的 df1 数据框：
df3

希望这有助于你实现你的目标。

英文:

Is there a way to merge two Pandas DataFrames, by matching on (and retaining) supplied columns, but overwriting all the rest?

For example:

import pandas as pd
df1 = pd.DataFrame(columns=[&quot;Name&quot;, &quot;Gender&quot;, &quot;Age&quot;, &quot;LastLogin&quot;, &quot;LastPurchase&quot;])
df1.loc[0] = [&quot;Bob&quot;, &quot;Male&quot;, &quot;21&quot;, &quot;2023-01-01&quot;, &quot;2023-01-01&quot;]
df1.loc[1] = [&quot;Frank&quot;, &quot;Male&quot;, &quot;22&quot;, &quot;2023-02-01&quot;, &quot;2023-02-01&quot;]
df1.loc[2] = [&quot;Steve&quot;, &quot;Male&quot;, &quot;23&quot;, &quot;2023-03-01&quot;, &quot;2023-03-01&quot;]
df1.loc[3] = [&quot;John&quot;, &quot;Male&quot;, &quot;24&quot;, &quot;2023-04-01&quot;, &quot;2023-04-01&quot;]
df2 = pd.DataFrame(columns=[&quot;Name&quot;, &quot;Gender&quot;, &quot;Age&quot;, &quot;LastLogin&quot;, &quot;LastPurchase&quot;])
df2.loc[0] = [&quot;Steve&quot;, &quot;Male&quot;, &quot;23&quot;, &quot;2022-11-01&quot;, &quot;2022-11-02&quot;]
df2.loc[1] = [&quot;Simon&quot;, &quot;Male&quot;, &quot;23&quot;, &quot;2023-03-01&quot;, &quot;2023-03-02&quot;]
df2.loc[2] = [&quot;Gary&quot;, &quot;Male&quot;, &quot;24&quot;, &quot;2023-04-01&quot;, &quot;2023-04-02&quot;]
df2.loc[3] = [&quot;Bob&quot;, &quot;Male&quot;, &quot;21&quot;, &quot;2022-12-01&quot;, &quot;2022-12-01&quot;]
&gt;&gt;&gt; df1
    Name Gender Age   LastLogin LastPurchase
0    Bob   Male  21  2023-01-01   2023-01-01
1  Frank   Male  22  2023-02-01   2023-02-01
2  Steve   Male  23  2023-03-01   2023-03-01
3   John   Male  24  2023-04-01   2023-04-01
&gt;&gt;&gt; df2
    Name Gender Age   LastLogin LastPurchase
0  Steve   Male  23  2022-11-01   2022-11-02
1  Simon   Male  23  2023-03-01   2023-03-02
2   Gary   Male  24  2023-04-01   2023-04-02
3    Bob   Male  21  2022-12-01   2022-12-01

What I'd like is to end up with is df1 updated with values from df2, if the "Name", "Gender" and "Age" columns match. But without caring what the other columns are, so I'd end up with this:

&gt;&gt;&gt; df1
    Name Gender Age   LastLogin LastPurchase
0    Bob   Male  21  2022-12-01   2022-12-01 # Updated last two columns from df2
1  Frank   Male  22  2023-02-01   2023-02-01
2  Steve   Male  23  2022-11-01   2022-11-02 # Updated last two columns from df2
3   John   Male  24  2023-04-01   2023-04-01

I can do a merge like this:

&gt;&gt;&gt; df3 = df1.merge(df2, on=[&quot;Name&quot;, &quot;Gender&quot;, &quot;Age&quot;], how=&#39;left&#39;)

But then I have to manually extract data from and drop the new columns created from the merge, using their names:

&gt;&gt;&gt; df3[&#39;LastLogin&#39;] = df3[&#39;LastLogin_y&#39;].fillna(df3[&#39;LastLogin_x&#39;])
&gt;&gt;&gt; df3[&#39;LastPurchase&#39;] = df3[&#39;LastPurchase_y&#39;].fillna(df3[&#39;LastPurchase_x&#39;])
&gt;&gt;&gt; df3.drop([&#39;LastLogin_x&#39;, &#39;LastLogin_y&#39;], axis=1, inplace=True)
&gt;&gt;&gt; df3.drop([&#39;LastPurchase_x&#39;, &#39;LastPurchase_y&#39;], axis=1, inplace=True)
&gt;&gt;&gt; 
&gt;&gt;&gt; df3
    Name Gender Age   LastLogin LastPurchase
0    Bob   Male  21  2022-12-01   2022-12-01
1  Frank   Male  22  2023-02-01   2023-02-01
2  Steve   Male  23  2022-11-01   2022-11-02
3   John   Male  24  2023-04-01   2023-04-01

I'm trying to avoid this, as I need a generic way to update batches of data, and I don't know all their column names (just the ones I want to match on).

答案1

得分: 2

你可以通过仅切片df1中的合并键，避免使用_x/_y列，然后使用fillna/combine_first与原始数据合并：

cols = ["Name", "Gender", "Age"]
df3 = df1[cols].merge(df2, how='left').fillna(df1)

另一种更复杂的方法使用索引：

df3 = (df2.set_index(cols)
          .combine_first(df1.set_index(cols))
          .reindex(df1[cols]).reset_index()
       )

输出：

    Name Gender Age   LastLogin LastPurchase
0    Bob   Male  21  2022-12-01   2022-12-01
1  Frank   Male  22  2023-02-01   2023-02-01
2  Steve   Male  23  2022-11-01   2022-11-02
3   John   Male  24  2023-04-01   2023-04-01

英文:

You can avoid the _x/_y columns by slicing only the the merging keys in df1 for merge, then fillna/combine_first with the original:

cols = [&quot;Name&quot;, &quot;Gender&quot;, &quot;Age&quot;]
df3 = df1[cols].merge(df2, how=&#39;left&#39;).fillna(df1)

A more convoluted approach using indexes:

df3 = (df2.set_index(cols)
.combine_first(df1.set_index(cols))
.reindex(df1[cols]).reset_index()
)

Output:

    Name Gender Age   LastLogin LastPurchase
0    Bob   Male  21  2022-12-01   2022-12-01
1  Frank   Male  22  2023-02-01   2023-02-01
2  Steve   Male  23  2022-11-01   2022-11-02
3   John   Male  24  2023-04-01   2023-04-01

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python Pandas DataFrame Merge on Columns with Overwrite

问题

答案1

在Django中迭代包含多个mpld3图像的字典的一般方法是什么？

正确使用dataclass装饰器

在Python中如何合并字节文件

如何使用Huggingface模型deberta-v3-base-absa-v1.1生成预定义方面的情感分数？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。