如何在pandas中根据任何列匹配来合并数据框?

huangapple go评论105阅读模式
英文:

How to merge dataframes by if ANY of the columns matches in pandas?

问题

以下是您要翻译的内容:

"I have two dataframes with similar data. I want to merge them to combine all the information into one dataframe. The problem is, I would like to prioritize data from one dataframe if there are conflicts between merge of multiple columns (df1 in the example). And also I want to do it on multiple columns if ANY of the chosen columns match.

I apologize if my explanation is not clear enough. If there is any other information I should provide please let me know.

The way I do it now. This works fine if I would choose only one column but I can't figure out how to do it on multiple.

  1. merge_by = ['id', 'name1', 'name2']
  2. a = df1.merge(df2, how='outer', on=merge_by)
  3. So how I would imagine this to work is
  4. df1.merge(df2, how='outer', on='id' or 'name1' or 'name2')
  1. df1= DataFrame([
  2. [0, 'john', 'bon', 'ron'],
  3. [1, 'alex', 'dale', 'bruce'],
  4. [2, 'joey', 'bill', 'maci'],
  5. [3, 'choi', 'nath', 'karl'],
  6. [4, 'walt', '', 'xander'],
  7. ], columns=['id','name1','name2','name3'])
  8. id name1 name2 name3
  9. 0 'john' 'bon' 'ron'
  10. 1 'alex' 'dale' 'bruce'
  11. 2 'joey' 'bill' 'maci'
  12. 3 'choi' 'nath' 'karl'
  13. 4 'walt' '' 'xander'
  1. df2= DataFrame([
  2. [0, 'emil', 'tia', 'bia'],
  3. [4, '', 'sara', 'carmen'],
  4. [5, 'aden', 'dale', 'leia'],
  5. [6, 'joey', 'jax', 'jace'],
  6. [7, 'choi', 'nath', 'andre'],
  7. [8, '', '', 'piper'],
  8. ], columns=['id','name1','name2','name3'])
  9. id name1 name2 name3
  10. 0 'emil' 'tia' 'bia'
  11. 4 '' 'sara' 'carmen'
  12. 5 'aden' 'dale' 'leia'
  13. 6 'joey' 'jax' 'jace'
  14. 7 'choi' 'nath' 'andre'
  15. 8 '' '' 'piper'

所需的输出是:

  1. id name1 name2 name3_x name3_y
  2. 0 'john' 'bon' 'ron' 'bia'
  3. 1 'alex' 'dale' 'bruce' 'leia'
  4. 2 'joey' 'bill' 'maci' 'jace'
  5. 3 'choi' 'nath' 'karl' 'andre'
  6. 4 'walt' 'sara' 'xander' 'carmen'
  7. 8 '' '' '' 'piper'

希望这对您有所帮助。

英文:

I have two dataframes with similar data. I want to merge them to combine all the information into one dataframe. The problem is, I would like to prioritize data from one dataframe if there are conflicts between merge of multiple columns (df1 in the example). And also I want to do it on multiple columns if ANY of the chosen columns match.

I apologize if my explanation is not clear enough. If there is any other information I should provide please let me know.

The way I do it now. This works fine if I would choose only one column but I can't figure out how to do it on multiple.

  1. merge_by = ['id', 'name1', 'name2']
  2. a = df1.merge(df2, how='outer', on=merge_by)
  3. So how I would imagine this to work is
  4. df1.merge(df2, how='outer', on='id' or 'name1' or 'name2')
  1. df1= DataFrame([
  2. [0, 'john', 'bon', 'ron'],
  3. [1, 'alex', 'dale', 'bruce'],
  4. [2, 'joey', 'bill', 'maci'],
  5. [3, 'choi', 'nath', 'karl'],
  6. [4, 'walt', '', 'xander'],
  7. ], columns=['id','name1','name2','name3'])
  8. id name1 name2 name3
  9. 0 'john' 'bon' 'ron'
  10. 1 'alex' 'dale' 'bruce'
  11. 2 'joey' 'bill' 'maci'
  12. 3 'choi' 'nath' 'karl'
  13. 4 'walt' '' 'xander'
  1. df2= DataFrame([
  2. [0, 'emil', 'tia', 'bia'],
  3. [4, '', 'sara', 'carmen'],
  4. [5, 'aden', 'dale', 'leia'],
  5. [6, 'joey', 'jax', 'jace'],
  6. [7, 'choi', 'nath', 'andre'],
  7. [8, '', '', 'piper'],
  8. ], columns=['id','name1','name2','name3'])
  9. id name1 name2 name3
  10. 0 'emil' 'tia' 'bia'
  11. 4 '' 'sara' 'carmen'
  12. 5 'aden' 'dale' 'leia'
  13. 6 'joey' 'jax' 'jace'
  14. 7 'choi' 'nath' 'andre'
  15. 8 '' '' 'piper'

The output I would want

  1. id name1 name2 name3_x name3_y
  2. 0 'john' 'bon' 'ron' 'bia'
  3. 1 'alex' 'dale' 'bruce' 'leia'
  4. 2 'joey' 'bill' 'maci' 'jace'
  5. 3 'choi' 'nath' 'karl' 'andre'
  6. 4 'walt' 'sara' 'xander' 'carmen'
  7. 8 '' '' '' 'piper'

Edit** Code taken from the answer here as suggested in the comments below.

  1. df1= pd.DataFrame([
  2. [0, 'john', 'bon', 'ron'],
  3. [1, 'alex', 'dale', 'bruce'],
  4. [2, 'joey', 'bill', 'maci'],
  5. [3, 'choi', 'nath', 'karl'],
  6. [4, 'walt', '', 'xander'],
  7. ], columns=['id','name1','name2','name3'])
  8. df2= pd.DataFrame([
  9. [0, 'emil', 'tia', 'bia'],
  10. [4, '', 'sara', 'carmen'],
  11. [5, 'aden', 'dale', 'leia'],
  12. [6, 'joey', 'jax', 'jace'],
  13. [7, 'choi', 'nath', 'andre'],
  14. [8, '', '', 'piper'],
  15. ], columns=['id','name1','name2','name3'])
  16. suff_A = ['_on_A_match_1', '_on_A_match_2']
  17. suff_B = ['_on_B_match_1', '_on_B_match_2']
  18. suff_C = ['_on_C_match_1', '_on_C_match_2']
  19. df = pd.concat([df1.merge(df2[df2['id'] != ''], on='id', suffixes=suff_A),
  20. df1.merge(df2[df2['name1'] != ''], on='name1', suffixes=suff_B),
  21. df1.merge(df2[df2['name2'] != ''], on='name2', suffixes=suff_C)])
  22. dups = (df.id_on_B_match_1 == df.id_on_B_match_2) # also could remove A_on_B_match
  23. a = df.loc[~dups]
  24. print(df)

如何在pandas中根据任何列匹配来合并数据框?

The problem with this one is that the id 3 is repeated, I am not sure how to set up dups with more than 2 columns. And also how could I format the final output to be only the answers that I want?

答案1

得分: 1

以下是您要翻译的代码部分:

  1. columns = "id", "name1", "name2"
  2. df = pd.concat(
  3. df1.merge(df2.dropna(subset=column), on=column, suffixes=["", "_y"])
  4. for column in columns
  5. ).drop_duplicates("id")
  6. ids = set(df["id"].dropna()).union(df["id_y"].dropna())
  7. pd.concat([
  8. df,
  9. df1[~df1["id"].isin(ids)],
  10. df2[~df2["id"].isin(ids)]
  11. ])
  1. id name1 name2 name3 name1_y name2_y name3_y id_y
  2. 0 0 john bon ron emil tia bia NaN
  3. 1 4 walt NaN xander NaN sara carmen NaN
  4. 0 2 joey bill maci NaN jax jace 6.0
  5. 1 3 choi nath karl NaN nath andre 7.0
  6. 0 1 alex dale bruce aden NaN leia 5.0
  7. 5 8 NaN NaN piper NaN NaN NaN NaN

希望这对您有所帮助。如果您需要任何进一步的帮助,请随时告诉我。

英文:
  1. columns = "id", "name1", "name2"
  2. df = pd.concat(
  3. df1.merge(df2.dropna(subset=column), on=column, suffixes=["", "_y"])
  4. for column in columns
  5. ).drop_duplicates("id")
  6. ids = set(df["id"].dropna()).union(df["id_y"].dropna())
  7. pd.concat([
  8. df,
  9. df1[~df1["id"].isin(ids)],
  10. df2[~df2["id"].isin(ids)]
  11. ])
  1. id name1 name2 name3 name1_y name2_y name3_y id_y
  2. 0 0 john bon ron emil tia bia NaN
  3. 1 4 walt NaN xander NaN sara carmen NaN
  4. 0 2 joey bill maci NaN jax jace 6.0
  5. 1 3 choi nath karl NaN nath andre 7.0
  6. 0 1 alex dale bruce aden NaN leia 5.0
  7. 5 8 NaN NaN piper NaN NaN NaN NaN

huangapple
  • 本文由 发表于 2023年5月11日 11:53:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76224043.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定