如何在pandas中根据任何列匹配来合并数据框?

huangapple go评论64阅读模式
英文:

How to merge dataframes by if ANY of the columns matches in pandas?

问题

以下是您要翻译的内容:

"I have two dataframes with similar data. I want to merge them to combine all the information into one dataframe. The problem is, I would like to prioritize data from one dataframe if there are conflicts between merge of multiple columns (df1 in the example). And also I want to do it on multiple columns if ANY of the chosen columns match.

I apologize if my explanation is not clear enough. If there is any other information I should provide please let me know.

The way I do it now. This works fine if I would choose only one column but I can't figure out how to do it on multiple.

merge_by = ['id', 'name1', 'name2']
a = df1.merge(df2, how='outer', on=merge_by)

So how I would imagine this to work is
df1.merge(df2, how='outer', on='id' or 'name1' or 'name2')
df1= DataFrame([
    [0, 'john', 'bon', 'ron'],
    [1, 'alex', 'dale', 'bruce'],
    [2, 'joey', 'bill', 'maci'],
    [3, 'choi', 'nath', 'karl'],
    [4, 'walt', '', 'xander'],
], columns=['id','name1','name2','name3'])


id   name1   name2   name3
0    'john'   'bon'   'ron'
1    'alex'   'dale'  'bruce'
2    'joey'   'bill'  'maci'
3    'choi'   'nath'  'karl'
4    'walt'   ''      'xander'
df2= DataFrame([
    [0, 'emil', 'tia', 'bia'],
    [4, '', 'sara', 'carmen'],
    [5, 'aden', 'dale', 'leia'],
    [6, 'joey', 'jax', 'jace'],
    [7, 'choi', 'nath', 'andre'],
    [8, '', '', 'piper'],
], columns=['id','name1','name2','name3'])


id   name1   name2   name3
0    'emil'   'tia'   'bia'
4    ''       'sara'  'carmen'
5    'aden'   'dale'  'leia'
6    'joey'   'jax'   'jace'
7    'choi'   'nath'  'andre'
8    ''       ''      'piper'

所需的输出是:

id   name1   name2   name3_x name3_y
0    'john'   'bon'   'ron'   'bia'
1    'alex'   'dale'  'bruce' 'leia'
2    'joey'   'bill'  'maci'  'jace'
3    'choi'   'nath'  'karl'  'andre'
4    'walt'   'sara'  'xander' 'carmen'
8    ''       ''      ''      'piper'

希望这对您有所帮助。

英文:

I have two dataframes with similar data. I want to merge them to combine all the information into one dataframe. The problem is, I would like to prioritize data from one dataframe if there are conflicts between merge of multiple columns (df1 in the example). And also I want to do it on multiple columns if ANY of the chosen columns match.

I apologize if my explanation is not clear enough. If there is any other information I should provide please let me know.

The way I do it now. This works fine if I would choose only one column but I can't figure out how to do it on multiple.

merge_by = ['id', 'name1', 'name2']
a = df1.merge(df2, how='outer', on=merge_by)

So how I would imagine this to work is
df1.merge(df2, how='outer', on='id' or 'name1' or 'name2')
df1= DataFrame([
    [0, 'john', 'bon', 'ron'],
    [1, 'alex', 'dale', 'bruce'],
    [2, 'joey', 'bill', 'maci'],
    [3, 'choi', 'nath', 'karl'],
    [4, 'walt', '', 'xander'],
], columns=['id','name1','name2','name3'])


id   name1   name2   name3
0    'john'   'bon'   'ron'
1    'alex'   'dale'  'bruce'
2    'joey'   'bill'  'maci'
3    'choi'   'nath'  'karl'
4    'walt'   ''      'xander'
df2= DataFrame([
    [0, 'emil', 'tia', 'bia'],
    [4, '', 'sara', 'carmen'],
    [5, 'aden', 'dale', 'leia'],
    [6, 'joey', 'jax', 'jace'],
    [7, 'choi', 'nath', 'andre'],
    [8, '', '', 'piper'],
], columns=['id','name1','name2','name3'])


id   name1   name2   name3
0    'emil'   'tia'   'bia'
4    ''       'sara'  'carmen'
5    'aden'   'dale'  'leia'
6    'joey'   'jax'   'jace'
7    'choi'   'nath'  'andre'
8    ''       ''      'piper'

The output I would want

id   name1   name2   name3_x name3_y
0    'john'   'bon'   'ron'   'bia'
1    'alex'   'dale'  'bruce' 'leia'
2    'joey'   'bill'  'maci'  'jace'
3    'choi'   'nath'  'karl'  'andre'
4    'walt'   'sara'  'xander' 'carmen'
8    ''       ''      ''      'piper'

Edit** Code taken from the answer here as suggested in the comments below.

df1= pd.DataFrame([
    [0, 'john', 'bon', 'ron'],
    [1, 'alex', 'dale', 'bruce'],
    [2, 'joey', 'bill', 'maci'],
    [3, 'choi', 'nath', 'karl'],
    [4, 'walt', '', 'xander'],
], columns=['id','name1','name2','name3'])

df2= pd.DataFrame([
    [0, 'emil', 'tia', 'bia'],
    [4, '', 'sara', 'carmen'],
    [5, 'aden', 'dale', 'leia'],
    [6, 'joey', 'jax', 'jace'],
    [7, 'choi', 'nath', 'andre'],
    [8, '', '', 'piper'],
], columns=['id','name1','name2','name3'])

suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']
suff_C = ['_on_C_match_1', '_on_C_match_2']

df = pd.concat([df1.merge(df2[df2['id'] != ''], on='id', suffixes=suff_A), 
                df1.merge(df2[df2['name1'] != ''], on='name1', suffixes=suff_B),
                df1.merge(df2[df2['name2'] != ''], on='name2', suffixes=suff_C)])

dups = (df.id_on_B_match_1 == df.id_on_B_match_2) # also could remove A_on_B_match
a = df.loc[~dups]
print(df)

如何在pandas中根据任何列匹配来合并数据框?

The problem with this one is that the id 3 is repeated, I am not sure how to set up dups with more than 2 columns. And also how could I format the final output to be only the answers that I want?

答案1

得分: 1

以下是您要翻译的代码部分:

columns = "id", "name1", "name2"

df = pd.concat(
   df1.merge(df2.dropna(subset=column), on=column, suffixes=["", "_y"])
   for column in columns
).drop_duplicates("id")

ids = set(df["id"].dropna()).union(df["id_y"].dropna())

pd.concat([
   df,
   df1[~df1["id"].isin(ids)],
   df2[~df2["id"].isin(ids)]
])
   id name1 name2   name3 name1_y name2_y name3_y  id_y
0   0  john   bon     ron    emil     tia     bia   NaN
1   4  walt   NaN  xander     NaN    sara  carmen   NaN
0   2  joey  bill    maci     NaN     jax    jace   6.0
1   3  choi  nath    karl     NaN    nath   andre   7.0
0   1  alex  dale   bruce    aden     NaN    leia   5.0
5   8   NaN   NaN   piper     NaN     NaN     NaN   NaN

希望这对您有所帮助。如果您需要任何进一步的帮助,请随时告诉我。

英文:
columns = "id", "name1", "name2"

df = pd.concat(
   df1.merge(df2.dropna(subset=column), on=column, suffixes=["", "_y"])
   for column in columns
).drop_duplicates("id")

ids = set(df["id"].dropna()).union(df["id_y"].dropna())

pd.concat([
   df,
   df1[~df1["id"].isin(ids)],
   df2[~df2["id"].isin(ids)]
])
   id name1 name2   name3 name1_y name2_y name3_y  id_y
0   0  john   bon     ron    emil     tia     bia   NaN
1   4  walt   NaN  xander     NaN    sara  carmen   NaN
0   2  joey  bill    maci     NaN     jax    jace   6.0
1   3  choi  nath    karl     NaN    nath   andre   7.0
0   1  alex  dale   bruce    aden     NaN    leia   5.0
5   8   NaN   NaN   piper     NaN     NaN     NaN   NaN

huangapple
  • 本文由 发表于 2023年5月11日 11:53:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76224043.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定