英文:
pandas: sequential merge adds new columns instead of replacing NaN values
问题
你可以使用suffixes
参数来指定合并时重名列的后缀。这样,你可以避免生成*_x和*_y这样的重复列名。以下是修改后的代码:
df_all = df_all.merge(df_additional_info1, how="left", on="uid", suffixes=("", "_additional1"))
df_all = df_all.merge(df_additional_info2, how="outer", on="uid", suffixes=("", "_additional2"))
这将给你所需的结果,使用空字符串作为后缀来避免生成重复的列名。
英文:
import pandas as pd
df_all = pd.DataFrame(columns=["uid", "a", "b"], data=[["uid1", 12, 15],
["uid2", 13, 16],
["uid3", 14, 17],
["uid4", 15, 18]])
df_additional_info1 = pd.DataFrame(columns=["uid", "c", "d"], data=[["uid1", 12, 15],
["uid3", 14, 17]])
df_additional_info2 = pd.DataFrame(columns=["uid", "c", "d"], data=[["uid2", 12, 15]])
I need to merge df_all twice with additional information. First with df_additional_info1, then with df_additional_info2 and so on. They will always contain additional info for already existing rows and only for those rows that were not yet updated.
When I do following:
df_all = df_all.merge(df_additional_info1, how="left", on="uid")
df_all = df_all.merge(df_additional_info2, how="outer", on="uid")
I get duplicated columns (*_x, *_y):
Any suggestion?
答案1
得分: 1
另一种可能的解决方案:
(pd.concat([df_all.set_index('uid'), df_additional_info1.set_index('uid'),
df_additional_info2.set_index('uid')])
.stack().unstack().reset_index())
或者,
(df_additional_info2.set_index('uid')
.combine_first(
pd.concat([df_all.set_index('uid'),
df_additional_info1.set_index('uid')], axis=1))
.reset_index())
解释:
该代码首先将 uid
设置为所有数据帧的索引,便于合并操作。它使用 pd.concat()
沿着列轴连接 df_all
和 df_additional_info1
,将 df_additional_info1
的列附加到 df_all
,创建一个包含两者所有列的数据帧,对于 df_additional_info1
中缺少的 uid
,用 NaN
填充。然后,combine_first()
用连接后的数据帧中的值替换 df_additional_info2
中的任何空值,实质上是使用 df_all
和 df_additional_info1
的数据填充 df_additional_info2
中的空白,基于 uid
。最后一步,reset_index()
将 uid
移回普通列并重新建立默认整数索引,生成一个合并的数据帧,包含所有可用的附加信息。
输出:
uid a b c d
0 uid1 12 15 12.0 15.0
1 uid2 13 16 12.0 15.0
2 uid3 14 17 14.0 17.0
3 uid4 15 18 NaN NaN
英文:
Another possible solution:
(pd.concat([df_all.set_index('uid'), df_additional_info1.set_index('uid'),
df_additional_info2.set_index('uid')])
.stack().unstack().reset_index())
Alternatively,
(df_additional_info2.set_index('uid')
.combine_first(
pd.concat([df_all.set_index('uid'),
df_additional_info1.set_index('uid')], axis=1))
.reset_index())
EXPLANATION:
The code first sets uid
as the index for all DataFrames, facilitating the merging operations. It concatenates df_all
and df_additional_info1
along the column axis using pd.concat()
, appending df_additional_info1
's columns to df_all
and creating a DataFrame containing all columns from both, filling with NaN
for uids
absent in df_additional_info1
. combine_first()
then replaces any null values in df_additional_info2
with values from the concatenated DataFrame, essentially filling in df_additional_info2
's gaps with data from df_all
and df_additional_info1
, based on uid
. The final step, reset_index()
, moves uid
back to a regular column and reestablishes the default integer index, producing a merged DataFrame with all available additional information.
Output:
uid a b c d
0 uid1 12 15 12.0 15.0
1 uid2 13 16 12.0 15.0
2 uid3 14 17 14.0 17.0
3 uid4 15 18 NaN NaN
答案2
得分: 0
你可以使用.concat
连接额外的信息帧:
df_all.merge(
pd.concat([df_additional_info1, df_additional_info2]),
how="left"
)
uid a b c d
0 uid1 12 15 12.0 15.0
1 uid2 13 16 12.0 15.0
2 uid3 14 17 14.0 17.0
3 uid4 15 18 NaN NaN
英文:
You can .concat
the additional info frames:
df_all.merge(
pd.concat([df_additional_info1, df_additional_info2]),
how = "left"
)
uid a b c d
0 uid1 12 15 12.0 15.0
1 uid2 13 16 12.0 15.0
2 uid3 14 17 14.0 17.0
3 uid4 15 18 NaN NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论