2023年5月11日 11:53:29go评论105阅读模式

英文:

How to merge dataframes by if ANY of the columns matches in pandas?

问题

以下是您要翻译的内容：

"I have two dataframes with similar data. I want to merge them to combine all the information into one dataframe. The problem is, I would like to prioritize data from one dataframe if there are conflicts between merge of multiple columns (df1 in the example). And also I want to do it on multiple columns if ANY of the chosen columns match.

I apologize if my explanation is not clear enough. If there is any other information I should provide please let me know.

The way I do it now. This works fine if I would choose only one column but I can't figure out how to do it on multiple.

merge_by = ['id', 'name1', 'name2']
a = df1.merge(df2, how='outer', on=merge_by)
So how I would imagine this to work is
df1.merge(df2, how='outer', on='id' or 'name1' or 'name2')

df1= DataFrame([
    [0, 'john', 'bon', 'ron'],
    [1, 'alex', 'dale', 'bruce'],
    [2, 'joey', 'bill', 'maci'],
    [3, 'choi', 'nath', 'karl'],
    [4, 'walt', '', 'xander'],
], columns=['id','name1','name2','name3'])
id   name1   name2   name3
0    'john'   'bon'   'ron'
1    'alex'   'dale'  'bruce'
2    'joey'   'bill'  'maci'
3    'choi'   'nath'  'karl'
4    'walt'   ''      'xander'

df2= DataFrame([
    [0, 'emil', 'tia', 'bia'],
    [4, '', 'sara', 'carmen'],
    [5, 'aden', 'dale', 'leia'],
    [6, 'joey', 'jax', 'jace'],
    [7, 'choi', 'nath', 'andre'],
    [8, '', '', 'piper'],
], columns=['id','name1','name2','name3'])
id   name1   name2   name3
0    'emil'   'tia'   'bia'
4    ''       'sara'  'carmen'
5    'aden'   'dale'  'leia'
6    'joey'   'jax'   'jace'
7    'choi'   'nath'  'andre'
8    ''       ''      'piper'

所需的输出是：

id   name1   name2   name3_x name3_y
0    'john'   'bon'   'ron'   'bia'
1    'alex'   'dale'  'bruce' 'leia'
2    'joey'   'bill'  'maci'  'jace'
3    'choi'   'nath'  'karl'  'andre'
4    'walt'   'sara'  'xander' 'carmen'
8    ''       ''      ''      'piper'

希望这对您有所帮助。

英文:

I have two dataframes with similar data. I want to merge them to combine all the information into one dataframe. The problem is, I would like to prioritize data from one dataframe if there are conflicts between merge of multiple columns (df1 in the example). And also I want to do it on multiple columns if ANY of the chosen columns match.

I apologize if my explanation is not clear enough. If there is any other information I should provide please let me know.

The way I do it now. This works fine if I would choose only one column but I can't figure out how to do it on multiple.

merge_by = [&#39;id&#39;, &#39;name1&#39;, &#39;name2&#39;]
a = df1.merge(df2, how=&#39;outer&#39;, on=merge_by)
So how I would imagine this to work is
df1.merge(df2, how=&#39;outer&#39;, on=&#39;id&#39; or &#39;name1&#39; or &#39;name2&#39;)

df1= DataFrame([
    [0, &#39;john&#39;, &#39;bon&#39;, &#39;ron&#39;],
    [1, &#39;alex&#39;, &#39;dale&#39;, &#39;bruce&#39;],
    [2, &#39;joey&#39;, &#39;bill&#39;, &#39;maci&#39;],
    [3, &#39;choi&#39;, &#39;nath&#39;, &#39;karl&#39;],
    [4, &#39;walt&#39;, &#39;&#39;, &#39;xander&#39;],
], columns=[&#39;id&#39;,&#39;name1&#39;,&#39;name2&#39;,&#39;name3&#39;])
id   name1   name2   name3
0    &#39;john&#39;   &#39;bon&#39;   &#39;ron&#39;
1    &#39;alex&#39;   &#39;dale&#39;  &#39;bruce&#39;
2    &#39;joey&#39;   &#39;bill&#39;  &#39;maci&#39;
3    &#39;choi&#39;   &#39;nath&#39;  &#39;karl&#39;
4    &#39;walt&#39;   &#39;&#39;      &#39;xander&#39;

df2= DataFrame([
    [0, &#39;emil&#39;, &#39;tia&#39;, &#39;bia&#39;],
    [4, &#39;&#39;, &#39;sara&#39;, &#39;carmen&#39;],
    [5, &#39;aden&#39;, &#39;dale&#39;, &#39;leia&#39;],
    [6, &#39;joey&#39;, &#39;jax&#39;, &#39;jace&#39;],
    [7, &#39;choi&#39;, &#39;nath&#39;, &#39;andre&#39;],
    [8, &#39;&#39;, &#39;&#39;, &#39;piper&#39;],
], columns=[&#39;id&#39;,&#39;name1&#39;,&#39;name2&#39;,&#39;name3&#39;])
id   name1   name2   name3
0    &#39;emil&#39;   &#39;tia&#39;   &#39;bia&#39;
4    &#39;&#39;       &#39;sara&#39;  &#39;carmen&#39;
5    &#39;aden&#39;   &#39;dale&#39;  &#39;leia&#39;
6    &#39;joey&#39;   &#39;jax&#39;   &#39;jace&#39;
7    &#39;choi&#39;   &#39;nath&#39;  &#39;andre&#39;
8    &#39;&#39;       &#39;&#39;      &#39;piper&#39;

The output I would want

id   name1   name2   name3_x name3_y
0    &#39;john&#39;   &#39;bon&#39;   &#39;ron&#39;   &#39;bia&#39;
1    &#39;alex&#39;   &#39;dale&#39;  &#39;bruce&#39; &#39;leia&#39;
2    &#39;joey&#39;   &#39;bill&#39;  &#39;maci&#39;  &#39;jace&#39;
3    &#39;choi&#39;   &#39;nath&#39;  &#39;karl&#39;  &#39;andre&#39;
4    &#39;walt&#39;   &#39;sara&#39;  &#39;xander&#39; &#39;carmen&#39;
8    &#39;&#39;       &#39;&#39;      &#39;&#39;      &#39;piper&#39;

Edit** Code taken from the answer here as suggested in the comments below.

df1= pd.DataFrame([
    [0, &#39;john&#39;, &#39;bon&#39;, &#39;ron&#39;],
    [1, &#39;alex&#39;, &#39;dale&#39;, &#39;bruce&#39;],
    [2, &#39;joey&#39;, &#39;bill&#39;, &#39;maci&#39;],
    [3, &#39;choi&#39;, &#39;nath&#39;, &#39;karl&#39;],
    [4, &#39;walt&#39;, &#39;&#39;, &#39;xander&#39;],
], columns=[&#39;id&#39;,&#39;name1&#39;,&#39;name2&#39;,&#39;name3&#39;])
df2= pd.DataFrame([
    [0, &#39;emil&#39;, &#39;tia&#39;, &#39;bia&#39;],
    [4, &#39;&#39;, &#39;sara&#39;, &#39;carmen&#39;],
    [5, &#39;aden&#39;, &#39;dale&#39;, &#39;leia&#39;],
    [6, &#39;joey&#39;, &#39;jax&#39;, &#39;jace&#39;],
    [7, &#39;choi&#39;, &#39;nath&#39;, &#39;andre&#39;],
    [8, &#39;&#39;, &#39;&#39;, &#39;piper&#39;],
], columns=[&#39;id&#39;,&#39;name1&#39;,&#39;name2&#39;,&#39;name3&#39;])
suff_A = [&#39;_on_A_match_1&#39;, &#39;_on_A_match_2&#39;]
suff_B = [&#39;_on_B_match_1&#39;, &#39;_on_B_match_2&#39;]
suff_C = [&#39;_on_C_match_1&#39;, &#39;_on_C_match_2&#39;]
df = pd.concat([df1.merge(df2[df2[&#39;id&#39;] != &#39;&#39;], on=&#39;id&#39;, suffixes=suff_A), 
                df1.merge(df2[df2[&#39;name1&#39;] != &#39;&#39;], on=&#39;name1&#39;, suffixes=suff_B),
                df1.merge(df2[df2[&#39;name2&#39;] != &#39;&#39;], on=&#39;name2&#39;, suffixes=suff_C)])
dups = (df.id_on_B_match_1 == df.id_on_B_match_2) # also could remove A_on_B_match
a = df.loc[~dups]
print(df)

The problem with this one is that the id 3 is repeated, I am not sure how to set up dups with more than 2 columns. And also how could I format the final output to be only the answers that I want?

答案1

得分: 1

以下是您要翻译的代码部分：

columns = "id", "name1", "name2"
df = pd.concat(
   df1.merge(df2.dropna(subset=column), on=column, suffixes=["", "_y"])
   for column in columns
).drop_duplicates("id")
ids = set(df["id"].dropna()).union(df["id_y"].dropna())
pd.concat([
   df,
   df1[~df1["id"].isin(ids)],
   df2[~df2["id"].isin(ids)]
])

   id name1 name2   name3 name1_y name2_y name3_y  id_y
0   0  john   bon     ron    emil     tia     bia   NaN
1   4  walt   NaN  xander     NaN    sara  carmen   NaN
0   2  joey  bill    maci     NaN     jax    jace   6.0
1   3  choi  nath    karl     NaN    nath   andre   7.0
0   1  alex  dale   bruce    aden     NaN    leia   5.0
5   8   NaN   NaN   piper     NaN     NaN     NaN   NaN

希望这对您有所帮助。如果您需要任何进一步的帮助，请随时告诉我。

英文:

columns = &quot;id&quot;, &quot;name1&quot;, &quot;name2&quot;
df = pd.concat(
   df1.merge(df2.dropna(subset=column), on=column, suffixes=[&quot;&quot;, &quot;_y&quot;])
   for column in columns
).drop_duplicates(&quot;id&quot;)
ids = set(df[&quot;id&quot;].dropna()).union(df[&quot;id_y&quot;].dropna())
pd.concat([
   df,
   df1[~df1[&quot;id&quot;].isin(ids)],
   df2[~df2[&quot;id&quot;].isin(ids)]
])

   id name1 name2   name3 name1_y name2_y name3_y  id_y
0   0  john   bon     ron    emil     tia     bia   NaN
1   4  walt   NaN  xander     NaN    sara  carmen   NaN
0   2  joey  bill    maci     NaN     jax    jace   6.0
1   3  choi  nath    karl     NaN    nath   andre   7.0
0   1  alex  dale   bruce    aden     NaN    leia   5.0
5   8   NaN   NaN   piper     NaN     NaN     NaN   NaN

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在pandas中根据任何列匹配来合并数据框？

问题

答案1

如何创建一个包含特定行业所有股票代码的列表？

如何修剪pyspark模式输出

如何模拟pymysqlpool.ConnectionPool构造函数？

Python POST请求带有JSON数据的工作在Requestbin上，但在本地不起作用。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。