合并具有相同键的两个数据框,其中包含多行。

huangapple go评论100阅读模式
英文:

Merging two DataFrames with multiple rows for the same key

问题

我有分成两个不同CSV的医疗数据,我需要合并它们。一个数据集包含基本的人口统计信息,第二个包含诊断代码。每个患者都被分配一个唯一的身份识别号码,称为INC_KEY,我已经简化成简单的数字,如下例所示:

df1:

  1. INC_KEY SEX AGE
  2. 1 F 40
  3. 2 F 24
  4. 3 M 66

df2:

  1. INC_KEY DCODE
  2. 1 BW241ZZ
  3. 1 BW28ZZZ
  4. 2 0BH17EZ
  5. 3 05H633Z
  6. 2 4A103BD
  7. 3 BR30ZZZ
  8. 1 BF42ZZZ

我需要合并这两个数据框,输出应该包含在df1中看到的三行,并为每个与该患者相关的DCODE附加列。像这样:

  1. INC_KEY SEX AGE DCODE1 DCODE2 DCODE3
  2. 1 F 40 BW241ZZ BW28ZZZ BF42ZZZ
  3. 2 F 24 0BH17EZ 4A103BD N/A
  4. 3 M 66 05H633Z BR30ZZZ N/A

我该如何操作?我尝试过左连接,但没有得到我想要的结果。

英文:

I have medical data split into two different CSVs, and I need to merge them. One data set contains basic demographic information, and the second contains diagnosis codes. Each patient is assigned a unique identification number called INC_KEY, which I've simplified to simple numbers, as shown in this example:

df1:

  1. INC_KEY SEX AGE
  2. 1 F 40
  3. 2 F 24
  4. 3 M 66

df2:

  1. INC_KEY DCODE
  2. 1 BW241ZZ
  3. 1 BW28ZZZ
  4. 2 0BH17EZ
  5. 3 05H633Z
  6. 2 4A103BD
  7. 3 BR30ZZZ
  8. 1 BF42ZZZ

I need to merge the two dataframes with the output containing the three rows as seen in df1 with appended columns for each dcode respective to that patient. Like this:

  1. INC_KEY SEX AGE DCODE1 DCODE2 DCODE3
  2. 1 F 40 BW241ZZ BW28ZZZ BF42ZZZ
  3. 2 F 24 0BH17EZ 4A103BD N/A
  4. 3 M 66 05H633Z BR30ZZZ N/A

How can I go about this? I've tried to do a left merge but it does not give the result I am looking for.

答案1

得分: 1

你可以使用.merge方法将这两个数据框根据INC_KEY列合并。然后,你可以使用.groupby()pd.concat()将各个行转换为所需的列。最后,你可以使用.drop()方法删除原始的“DCODE”列:

  1. df = df1.merge(df2, on="INC_KEY", how="right")
  2. df = df.groupby(["INC_KEY", "SEX", "AGE"]).agg({"DCODE": list}).reset_index()
  3. df = pd.concat(
  4. (df, pd.DataFrame(df["DCODE"].values.tolist()).add_prefix("DCODE")),
  5. axis=1
  6. )
  7. df = df.drop("DCODE", axis=1)

这将输出:

  1. INC_KEY SEX AGE DCODE0 DCODE1 DCODE2
  2. 0 1 F 40 BW241ZZ BW28ZZZ BF42ZZZ
  3. 1 2 F 24 0BH17EZ 4A103BD None
  4. 2 3 M 66 05H633Z BR30ZZZ None
英文:

You can combine the two dataframes on the INC_KEY column using .merge. Then, you can use .groupby() and pd.concat() to turn individual rows into the desired columns. Finally, you can drop the original "DCODE" column using .drop():

  1. df = df1.merge(df2, on="INC_KEY", how="right")
  2. df = df.groupby(["INC_KEY", "SEX", "AGE"]).agg({"DCODE": list}).reset_index()
  3. df = pd.concat(
  4. (df, pd.DataFrame(df["DCODE"].values.tolist()).add_prefix("DCODE")),
  5. axis=1
  6. )
  7. df = df.drop("DCODE", axis=1)

This outputs:

  1. INC_KEY SEX AGE DCODE0 DCODE1 DCODE2
  2. 0 1 F 40 BW241ZZ BW28ZZZ BF42ZZZ
  3. 1 2 F 24 0BH17EZ 4A103BD None
  4. 2 3 M 66 05H633Z BR30ZZZ None

答案2

得分: 0

这是另一种方式:

  1. df_out = df1.merge(df2, on='INC_KEY')
  2. df_out = df_out.set_index(['INC_KEY', 'SEX', 'AGE', df_out.groupby('INC_KEY').cumcount()]).unstack()
  3. df_out.columns = [f'{i}{j}' for i, j in df_out.columns]
  4. df_out.reset_index()

输出:

  1. INC_KEY SEX AGE DCODE0 DCODE1 DCODE2
  2. 0 1 F 40 BW241ZZ BW28ZZZ BF42ZZZ
  3. 1 2 F 24 0BH17EZ 4A103BD NaN
  4. 2 3 M 66 05H633Z BR30ZZZ NaN
英文:

Here's another way:

  1. df_out = df1.merge(df2, on='INC_KEY')
  2. df_out = df_out.set_index(['INC_KEY', 'SEX', 'AGE', df_out.groupby('INC_KEY').cumcount()]).unstack()
  3. df_out.columns = [f'{i}{j}' for i, j in df_out.columns]
  4. df_out.reset_index()

Output:

  1. INC_KEY SEX AGE DCODE0 DCODE1 DCODE2
  2. 0 1 F 40 BW241ZZ BW28ZZZ BF42ZZZ
  3. 1 2 F 24 0BH17EZ 4A103BD NaN
  4. 2 3 M 66 05H633Z BR30ZZZ NaN

huangapple
  • 本文由 发表于 2023年1月9日 08:37:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75052293.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定