2023年3月3日 21:00:12go评论96阅读模式

英文:

Merge certain columns of a pandas dataframe with data from another dataframe by condition

问题

我明白你的需求。你想要在DataFrame df1 中使用DataFrame df2 中的数据填充空白的'Region'和'Country'列，但同时保留df1 中的信息。下面是一个代码示例，可以实现这一目标：

import pandas as pd
import numpy as np
# 数据示例
data1 = {
    'Substance': ['Substance1', 'Substance2', 'Substance1', 'Substance3', 'Substance2', 'Substance1', 'Substance3', 'Substance2', 'Substance4', 'Product A'],
    'Name': ['Bayer', 'Sanofi', 'Pfizer', 'AstraZeneca', 'BionTech', 'Sanofi', 'Sanofi', 'Bayer', 'Pfizer', 'Braun'],
    'Region': ['Europe', 'Europe', 'Asia', 'Asia', 'North America', np.nan, np.nan, 'Europe', 'Asia', 'Asia'],
    'Country': ['France', np.nan, np.nan, 'P.R. China', 'United States', np.nan, np.nan, np.nan, 'India', 'Indonesia'],
    'Misc': [910, 200, 898, 910, 910, 12, 34, 700, 800, 10000],
}
data2 = {
    'Substance': ['Substance1', 'Substance1', 'Substance1', 'Substance3', 'Substance2', 'Substance3'],
    'Name': ['Bayer', 'Sanofi', 'Pfizer', 'AstraZeneca', 'BionTech', 'Sanofi'],
    'Region': ['Europe', 'Europe', 'Asia', 'Asia', 'North America', np.nan],
    'Country': ['France', 'Germany', 'India', 'P.R. China', 'United States', 'Poland'],
    'Misc': [910, 200, 898, 910, 910, 40],
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# 合并数据，保留df1中的非空值
df3 = df1.copy()  # 复制df1以保留原始数据
df3[['Region', 'Country']] = df1[['Region', 'Country']].fillna(df2[['Region', 'Country'])
# 输出结果
print(df3)

这段代码将保留df1 中的非空值，并使用df2 中的数据填充空白的'Region'和'Country'列。如果df1 中的某一行在这两列中有值，将不会被df2 中的值替代。

英文:

I have the following challenge: I have two Pandas Dataframes with information about eg. chemical substances and related to some additional information. E.g. production region or country.
For example, like this:

data1 = {
    &#39;Substance&#39; : [&#39;Substance1&#39;, &#39;Substance2&#39;, &#39;Substance1&#39;, &#39;Substance3&#39;, &#39;Substance2&#39;, &#39;Substance1&#39;, &#39;Substance3&#39;, &#39;Substance2&#39;, &#39;Substance4&#39;],
     &#39;Name&#39; : [&#39;Bayer&#39;, &#39;Sanofi&#39;, &#39;Pfizer&#39;, &#39;AstraZeneca&#39;, &#39;BionTech&#39;, &#39;Sanofi&#39;, &#39;Sanofi&#39;, &#39;Bayer&#39;, &#39;Pfizer&#39;],
     &#39;Region&#39; : [&#39;Europe&#39;, &#39;Europe&#39;, &#39;Asia&#39;, &#39;Asia&#39;, &#39;North America&#39;, np.nan, np.nan, &#39;Europe&#39;, &#39;Asia&#39;],
     &#39;Country&#39; : [&#39;France&#39;, np.nan, np.nan, &#39;P.R. China&#39;, &#39;United States&#39;, np.nan, np.nan, np.nan, &#39;India&#39;],
     &#39;Misc&#39; : [910, 200, 898, 910, 910, 12, 34, 700, 800],
     }
df1 = pd.DataFrame(data1)
#print(df1)

and

data2 = {
    &#39;Substance&#39; : [&#39;Substance1&#39;, &#39;Substance2&#39;, &#39;Substance1&#39;, &#39;Substance3&#39;, &#39;Substance2&#39;],
     &#39;Name&#39; : [&#39;Bayer&#39;, &#39;Sanofi&#39;, &#39;Pfizer&#39;, &#39;AstraZeneca&#39;, &#39;BionTech&#39;],
     &#39;Region&#39; : [&#39;Europe&#39;, &#39;Europe&#39;, &#39;Asia&#39;, &#39;Asia&#39;, &#39;North America&#39;],
     &#39;Country&#39; : [&#39;France&#39;, &#39;Germany&#39;, &#39;India&#39;, &#39;P.R. China&#39;, &#39;United States&#39;],
     &#39;Misc&#39; : [910, 200, 898, 910, 910],
     }
df2 = pd.DataFrame(data2)
#print(df2)

In the first data frame all orders from one year are listed. It happens that a substance is also supplied by several manufacturers and that the row appears more than once (== number of orders).
I would like to supplement the dataframe (df1) with information from certain columns of another dataframe (df2). The goal is, if in df1 for a substance and a manufacturer the value in the column 'Region' or 'Country' is empty, then please insert the value from the corresponding column from df2.

df3 = df1.loc[df1.Substance.isin(df2.Substance), [&#39;Substance&#39;, &#39;Region&#39;, &#39;Country&#39;]] = df2[[&#39;Substance&#39;, &#39;Region&#39;, &#39;Country&#39;]]
#print(df3)

However, I then lose many rows (e.g. if substances are repeated). But I would keep all rows in df, just supplemented with information from df2.

I would expect the following result for df3:


|   |  Substance    |     Name     |    Region    |    Country    |  Misc |
-------------------------------------------------------------------
| 0 | Substance1    |    Bayer     |    Europe    |     France    |  910  |
| 1 | Substance2    |   Sanofi     |   Europe     |     Germany   |  200  |
| 2 | Substance1    |   Pfizer     |      Asia    |      India    |  898  |
| 3 | Substance3    | AstraZeneca  |      Asia    |   P.R. China  |  910  |
| 4 | Substance2    | BionTech     |North America |	United States |  910  |
| 5 | Substance1    |   Sanofi     |       Europe |        France |   12  |
| 6 | Substance3    |   Sanfoi     |       np.nan |       np.nan  |   34  |
| 7 | Substance2    |    Bayer     |    Europe    |       Germany |  700  |
| 8 | Substance4    |   Pfizer     |      Asia    |      India    |  800  |

Any suggestions would be appreciated. Many thanks in advance!

EDIT:

I'm not sure if this is a new question or an edit of my old question - that's why I'm posting this as a comment and want to ask for your help again, especially @mozway.

data1 = {
        &#39;Substance&#39; : [&#39;Substance1&#39;, &#39;Substance2&#39;, &#39;Substance1&#39;, &#39;Substance3&#39;, &#39;Substance2&#39;, &#39;Substance1&#39;, &#39;Substance3&#39;, &#39;Substance2&#39;, &#39;Substance4&#39;, &#39;Product A&#39;],
         &#39;Name&#39; : [&#39;Bayer&#39;, &#39;Sanofi&#39;, &#39;Pfizer&#39;, &#39;AstraZeneca&#39;, &#39;BionTech&#39;, &#39;Sanofi&#39;, &#39;Sanofi&#39;, &#39;Bayer&#39;, &#39;Pfizer&#39;, &#39;Braun&#39;],
         &#39;Region&#39; : [&#39;Europe&#39;, &#39;Europe&#39;, &#39;Asia&#39;, &#39;Asia&#39;, &#39;North America&#39;, np.nan, np.nan, &#39;Europe&#39;, &#39;Asia&#39;, &#39;Asia&#39;],
         &#39;Country&#39; : [&#39;France&#39;, np.nan, np.nan, &#39;P.R. China&#39;, &#39;United States&#39;, np.nan, np.nan, np.nan, &#39;India&#39;, &#39;Indonesia&#39;],
         &#39;Misc&#39; : [910, 200, 898, 910, 910, 12, 34, 700, 800, 10000],
         }
    
df1 = pd.DataFrame(data1)
#print(&#39;line &#39;, lineno(), &#39;\n &#39;, df1)

data2 = {
	&#39;Substance&#39; : [&#39;Substance1&#39;, &#39;Substance1&#39;, &#39;Substance1&#39;, &#39;Substance3&#39;, &#39;Substance2&#39;, &#39;Substance3&#39;],
	 &#39;Name&#39; : [&#39;Bayer&#39;, &#39;Sanofi&#39;, &#39;Pfizer&#39;, &#39;AstraZeneca&#39;, &#39;BionTech&#39;, &#39;Sanofi&#39;],
	 &#39;Region&#39; : [&#39;Europe&#39;, &#39;Europe&#39;, &#39;Asia&#39;, &#39;Asia&#39;, &#39;North America&#39;, np.nan],
	 &#39;Country&#39; : [&#39;France&#39;, &#39;Germany&#39;, &#39;India&#39;, &#39;P.R. China&#39;, &#39;United States&#39;, &#39;Poland&#39;],
	 &#39;Misc&#39; : [910, 200, 898, 910, 910,40],
	 }
df2 = pd.DataFrame(data2)
#print(&#39;line &#39;, lineno(), &#39;\n &#39;, df2)

Using 'merge' as suggested by @mozway:

out = (df1[[&#39;Substance&#39;, 
			&#39;Name&#39;, 
		   # &#39;Region&#39;, 
			#&#39;Country&#39;, 
			]]
   .merge(df2#.drop(columns=&#39;Substance&#39;)
			.groupby([&#39;Substance&#39;, 
					  &#39;Name&#39;
					  ], as_index=False).first(),
		  how=&#39;left&#39;)
  )
#print(&#39;line &#39;, lineno(), &#39;\n &#39;,out)

It results in

      Substance         Name         Region        Country   Misc
0  Substance1        Bayer         Europe         France  910.0
1  Substance2       Sanofi            NaN            NaN    NaN
2  Substance1       Pfizer           Asia          India  898.0
3  Substance3  AstraZeneca           Asia     P.R. China  910.0
4  Substance2     BionTech  North America  United States  910.0
5  Substance1       Sanofi         Europe        Germany  200.0
6  Substance3       Sanofi            NaN            NaN    NaN
7  Substance2        Bayer            NaN            NaN    NaN
8  Substance4       Pfizer            NaN            NaN    NaN

But I want the columns 'Region' and 'Country' to be completed by the values of the columns 'Region' and 'Country'. In the example of line 5 in the out-dataframe the value of Substance 1, distributed by Sanofi supplemented from line 1 in df2. That's great. But, the information from df1 in this certain columns is not considered. See, for instance, the line for Product A, where the values of column 'Region' and 'Country' are NaNs. Or line 2 of the out-dataframe where Substance 2 by Sanofihas NaNs in the Columns 'Region' and 'Country', even if the information for 'Region' is given in df1 (line 1).

I am looking for an approach, which keeps the information from df1 (if any exists --> not NaN). If not, I would like to add the information from df2. I really appreciate your support, many thanks in advance!

答案1

得分: 3

你可以使用自定义的 merge：

out = (df1[['Substance', 'Name']]
       .merge(df2.drop(columns='Substance')
                .groupby('Name', as_index=False).first(),
              how='left')
      )

输出：

    Substance         Name         Region        Country  Misc
0  Substance1        Bayer         Europe         France   910
1  Substance2       Sanofi         Europe        Germany   200
2  Substance1       Pfizer           Asia          India   898
3  Substance3  AstraZeneca           Asia     P.R. China   910
4  Substance2     BionTech  North America  United States   910
5  Substance1       Sanofi         Europe        Germany   200
6  Substance3       Sanofi         Europe        Germany   200
7  Substance2        Bayer         Europe         France   910
8  Substance4       Pfizer           Asia          India   898

英文:

You can use a custom merge:

out = (df1[[&#39;Substance&#39;, &#39;Name&#39;]]
       .merge(df2.drop(columns=&#39;Substance&#39;)
                .groupby(&#39;Name&#39;, as_index=False).first(),
              how=&#39;left&#39;)
      )

Output:

    Substance         Name         Region        Country  Misc
0  Substance1        Bayer         Europe         France   910
1  Substance2       Sanofi         Europe        Germany   200
2  Substance1       Pfizer           Asia          India   898
3  Substance3  AstraZeneca           Asia     P.R. China   910
4  Substance2     BionTech  North America  United States   910
5  Substance1       Sanofi         Europe        Germany   200
6  Substance3       Sanofi         Europe        Germany   200
7  Substance2        Bayer         Europe         France   910
8  Substance4       Pfizer           Asia          India   898

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将一个Pandas数据框的特定列与另一个数据框的数据按条件合并

问题

答案1

如何正确使用ModelSerializer进行补丁操作？

如何点击Tradingview上的“Tweet Image”按钮？

Fitz draw_rect 坐标

Error message <selenium.common.exceptions.InvalidSelectorException> when extract information from website using selenium webdriver

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。