英文:
How to merge dataframes by if ANY of the columns matches in pandas?
问题
以下是您要翻译的内容:
"I have two dataframes with similar data. I want to merge them to combine all the information into one dataframe. The problem is, I would like to prioritize data from one dataframe if there are conflicts between merge of multiple columns (df1 in the example). And also I want to do it on multiple columns if ANY of the chosen columns match.
I apologize if my explanation is not clear enough. If there is any other information I should provide please let me know.
The way I do it now. This works fine if I would choose only one column but I can't figure out how to do it on multiple.
merge_by = ['id', 'name1', 'name2']
a = df1.merge(df2, how='outer', on=merge_by)
So how I would imagine this to work is
df1.merge(df2, how='outer', on='id' or 'name1' or 'name2')
df1= DataFrame([
[0, 'john', 'bon', 'ron'],
[1, 'alex', 'dale', 'bruce'],
[2, 'joey', 'bill', 'maci'],
[3, 'choi', 'nath', 'karl'],
[4, 'walt', '', 'xander'],
], columns=['id','name1','name2','name3'])
id name1 name2 name3
0 'john' 'bon' 'ron'
1 'alex' 'dale' 'bruce'
2 'joey' 'bill' 'maci'
3 'choi' 'nath' 'karl'
4 'walt' '' 'xander'
df2= DataFrame([
[0, 'emil', 'tia', 'bia'],
[4, '', 'sara', 'carmen'],
[5, 'aden', 'dale', 'leia'],
[6, 'joey', 'jax', 'jace'],
[7, 'choi', 'nath', 'andre'],
[8, '', '', 'piper'],
], columns=['id','name1','name2','name3'])
id name1 name2 name3
0 'emil' 'tia' 'bia'
4 '' 'sara' 'carmen'
5 'aden' 'dale' 'leia'
6 'joey' 'jax' 'jace'
7 'choi' 'nath' 'andre'
8 '' '' 'piper'
所需的输出是:
id name1 name2 name3_x name3_y
0 'john' 'bon' 'ron' 'bia'
1 'alex' 'dale' 'bruce' 'leia'
2 'joey' 'bill' 'maci' 'jace'
3 'choi' 'nath' 'karl' 'andre'
4 'walt' 'sara' 'xander' 'carmen'
8 '' '' '' 'piper'
希望这对您有所帮助。
英文:
I have two dataframes with similar data. I want to merge them to combine all the information into one dataframe. The problem is, I would like to prioritize data from one dataframe if there are conflicts between merge of multiple columns (df1 in the example). And also I want to do it on multiple columns if ANY of the chosen columns match.
I apologize if my explanation is not clear enough. If there is any other information I should provide please let me know.
The way I do it now. This works fine if I would choose only one column but I can't figure out how to do it on multiple.
merge_by = ['id', 'name1', 'name2']
a = df1.merge(df2, how='outer', on=merge_by)
So how I would imagine this to work is
df1.merge(df2, how='outer', on='id' or 'name1' or 'name2')
df1= DataFrame([
[0, 'john', 'bon', 'ron'],
[1, 'alex', 'dale', 'bruce'],
[2, 'joey', 'bill', 'maci'],
[3, 'choi', 'nath', 'karl'],
[4, 'walt', '', 'xander'],
], columns=['id','name1','name2','name3'])
id name1 name2 name3
0 'john' 'bon' 'ron'
1 'alex' 'dale' 'bruce'
2 'joey' 'bill' 'maci'
3 'choi' 'nath' 'karl'
4 'walt' '' 'xander'
df2= DataFrame([
[0, 'emil', 'tia', 'bia'],
[4, '', 'sara', 'carmen'],
[5, 'aden', 'dale', 'leia'],
[6, 'joey', 'jax', 'jace'],
[7, 'choi', 'nath', 'andre'],
[8, '', '', 'piper'],
], columns=['id','name1','name2','name3'])
id name1 name2 name3
0 'emil' 'tia' 'bia'
4 '' 'sara' 'carmen'
5 'aden' 'dale' 'leia'
6 'joey' 'jax' 'jace'
7 'choi' 'nath' 'andre'
8 '' '' 'piper'
The output I would want
id name1 name2 name3_x name3_y
0 'john' 'bon' 'ron' 'bia'
1 'alex' 'dale' 'bruce' 'leia'
2 'joey' 'bill' 'maci' 'jace'
3 'choi' 'nath' 'karl' 'andre'
4 'walt' 'sara' 'xander' 'carmen'
8 '' '' '' 'piper'
Edit** Code taken from the answer here as suggested in the comments below.
df1= pd.DataFrame([
[0, 'john', 'bon', 'ron'],
[1, 'alex', 'dale', 'bruce'],
[2, 'joey', 'bill', 'maci'],
[3, 'choi', 'nath', 'karl'],
[4, 'walt', '', 'xander'],
], columns=['id','name1','name2','name3'])
df2= pd.DataFrame([
[0, 'emil', 'tia', 'bia'],
[4, '', 'sara', 'carmen'],
[5, 'aden', 'dale', 'leia'],
[6, 'joey', 'jax', 'jace'],
[7, 'choi', 'nath', 'andre'],
[8, '', '', 'piper'],
], columns=['id','name1','name2','name3'])
suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']
suff_C = ['_on_C_match_1', '_on_C_match_2']
df = pd.concat([df1.merge(df2[df2['id'] != ''], on='id', suffixes=suff_A),
df1.merge(df2[df2['name1'] != ''], on='name1', suffixes=suff_B),
df1.merge(df2[df2['name2'] != ''], on='name2', suffixes=suff_C)])
dups = (df.id_on_B_match_1 == df.id_on_B_match_2) # also could remove A_on_B_match
a = df.loc[~dups]
print(df)
The problem with this one is that the id 3 is repeated, I am not sure how to set up dups with more than 2 columns. And also how could I format the final output to be only the answers that I want?
答案1
得分: 1
以下是您要翻译的代码部分:
columns = "id", "name1", "name2"
df = pd.concat(
df1.merge(df2.dropna(subset=column), on=column, suffixes=["", "_y"])
for column in columns
).drop_duplicates("id")
ids = set(df["id"].dropna()).union(df["id_y"].dropna())
pd.concat([
df,
df1[~df1["id"].isin(ids)],
df2[~df2["id"].isin(ids)]
])
id name1 name2 name3 name1_y name2_y name3_y id_y
0 0 john bon ron emil tia bia NaN
1 4 walt NaN xander NaN sara carmen NaN
0 2 joey bill maci NaN jax jace 6.0
1 3 choi nath karl NaN nath andre 7.0
0 1 alex dale bruce aden NaN leia 5.0
5 8 NaN NaN piper NaN NaN NaN NaN
希望这对您有所帮助。如果您需要任何进一步的帮助,请随时告诉我。
英文:
columns = "id", "name1", "name2"
df = pd.concat(
df1.merge(df2.dropna(subset=column), on=column, suffixes=["", "_y"])
for column in columns
).drop_duplicates("id")
ids = set(df["id"].dropna()).union(df["id_y"].dropna())
pd.concat([
df,
df1[~df1["id"].isin(ids)],
df2[~df2["id"].isin(ids)]
])
id name1 name2 name3 name1_y name2_y name3_y id_y
0 0 john bon ron emil tia bia NaN
1 4 walt NaN xander NaN sara carmen NaN
0 2 joey bill maci NaN jax jace 6.0
1 3 choi nath karl NaN nath andre 7.0
0 1 alex dale bruce aden NaN leia 5.0
5 8 NaN NaN piper NaN NaN NaN NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论