英文:
Merging two DataFrames with multiple rows for the same key
问题
我有分成两个不同CSV的医疗数据,我需要合并它们。一个数据集包含基本的人口统计信息,第二个包含诊断代码。每个患者都被分配一个唯一的身份识别号码,称为INC_KEY,我已经简化成简单的数字,如下例所示:
df1:
INC_KEY SEX AGE
1 F 40
2 F 24
3 M 66
df2:
INC_KEY DCODE
1 BW241ZZ
1 BW28ZZZ
2 0BH17EZ
3 05H633Z
2 4A103BD
3 BR30ZZZ
1 BF42ZZZ
我需要合并这两个数据框,输出应该包含在df1中看到的三行,并为每个与该患者相关的DCODE附加列。像这样:
INC_KEY SEX AGE DCODE1 DCODE2 DCODE3
1 F 40 BW241ZZ BW28ZZZ BF42ZZZ
2 F 24 0BH17EZ 4A103BD N/A
3 M 66 05H633Z BR30ZZZ N/A
我该如何操作?我尝试过左连接,但没有得到我想要的结果。
英文:
I have medical data split into two different CSVs, and I need to merge them. One data set contains basic demographic information, and the second contains diagnosis codes. Each patient is assigned a unique identification number called INC_KEY, which I've simplified to simple numbers, as shown in this example:
df1:
INC_KEY SEX AGE
1 F 40
2 F 24
3 M 66
df2:
INC_KEY DCODE
1 BW241ZZ
1 BW28ZZZ
2 0BH17EZ
3 05H633Z
2 4A103BD
3 BR30ZZZ
1 BF42ZZZ
I need to merge the two dataframes with the output containing the three rows as seen in df1 with appended columns for each dcode respective to that patient. Like this:
INC_KEY SEX AGE DCODE1 DCODE2 DCODE3
1 F 40 BW241ZZ BW28ZZZ BF42ZZZ
2 F 24 0BH17EZ 4A103BD N/A
3 M 66 05H633Z BR30ZZZ N/A
How can I go about this? I've tried to do a left merge but it does not give the result I am looking for.
答案1
得分: 1
你可以使用.merge
方法将这两个数据框根据INC_KEY
列合并。然后,你可以使用.groupby()
和pd.concat()
将各个行转换为所需的列。最后,你可以使用.drop()
方法删除原始的“DCODE”列:
df = df1.merge(df2, on="INC_KEY", how="right")
df = df.groupby(["INC_KEY", "SEX", "AGE"]).agg({"DCODE": list}).reset_index()
df = pd.concat(
(df, pd.DataFrame(df["DCODE"].values.tolist()).add_prefix("DCODE")),
axis=1
)
df = df.drop("DCODE", axis=1)
这将输出:
INC_KEY SEX AGE DCODE0 DCODE1 DCODE2
0 1 F 40 BW241ZZ BW28ZZZ BF42ZZZ
1 2 F 24 0BH17EZ 4A103BD None
2 3 M 66 05H633Z BR30ZZZ None
英文:
You can combine the two dataframes on the INC_KEY
column using .merge
. Then, you can use .groupby()
and pd.concat()
to turn individual rows into the desired columns. Finally, you can drop the original "DCODE"
column using .drop()
:
df = df1.merge(df2, on="INC_KEY", how="right")
df = df.groupby(["INC_KEY", "SEX", "AGE"]).agg({"DCODE": list}).reset_index()
df = pd.concat(
(df, pd.DataFrame(df["DCODE"].values.tolist()).add_prefix("DCODE")),
axis=1
)
df = df.drop("DCODE", axis=1)
This outputs:
INC_KEY SEX AGE DCODE0 DCODE1 DCODE2
0 1 F 40 BW241ZZ BW28ZZZ BF42ZZZ
1 2 F 24 0BH17EZ 4A103BD None
2 3 M 66 05H633Z BR30ZZZ None
答案2
得分: 0
这是另一种方式:
df_out = df1.merge(df2, on='INC_KEY')
df_out = df_out.set_index(['INC_KEY', 'SEX', 'AGE', df_out.groupby('INC_KEY').cumcount()]).unstack()
df_out.columns = [f'{i}{j}' for i, j in df_out.columns]
df_out.reset_index()
输出:
INC_KEY SEX AGE DCODE0 DCODE1 DCODE2
0 1 F 40 BW241ZZ BW28ZZZ BF42ZZZ
1 2 F 24 0BH17EZ 4A103BD NaN
2 3 M 66 05H633Z BR30ZZZ NaN
英文:
Here's another way:
df_out = df1.merge(df2, on='INC_KEY')
df_out = df_out.set_index(['INC_KEY', 'SEX', 'AGE', df_out.groupby('INC_KEY').cumcount()]).unstack()
df_out.columns = [f'{i}{j}' for i, j in df_out.columns]
df_out.reset_index()
Output:
INC_KEY SEX AGE DCODE0 DCODE1 DCODE2
0 1 F 40 BW241ZZ BW28ZZZ BF42ZZZ
1 2 F 24 0BH17EZ 4A103BD NaN
2 3 M 66 05H633Z BR30ZZZ NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论