2023年7月12日 20:48:25go评论110阅读模式

英文:

Merging pandas dataframes with 2 keys that are not the same

问题

我有这两个表格：

tab1         tab2
col1 col2    col1 col2  col3
A    2017    A    2017  foo
A    2018    A    2019  fii
A    2019    A    2020  fee
B    2017    B    2017  boo
B    2019    B    2020  bii
C    2017    C    2017  coo
C    2018    C    2018  cii

我想使用Python合并这两个表格，以"col1"和"col2"作为关键字。但是我的问题是，例如，在第二行，tab1中有(A, 2018)，但在tab2中有(A, 2019或2017)，所以合并后的表格中该行将为"NaN"。

所以，我的问题是如何用"tab2"中最接近的对应年份来填充这些行？而不是成为"NaN"行，它将被填充为(A, 2019)。

因此，结果可能如下所示：

merged_tab
col1 col2  col3
A    2017  foo
A    2018  fii
A    2019  fii
B    2017  boo
B    2019  boo
C    2017  coo
C    2018  cii

谢谢！

英文:

I have these 2 tables :

tab1         tab2
col1 col2    col1 col2  col3
A    2017    A    2017  foo
A    2018    A    2019  fii
A    2019    A    2020  fee
B    2017    B    2017  boo
B    2019    B    2020  bii
C    2017    C    2017  coo
C    2018    C    2018  cii

I want to merge using Python these two tables with both col1 and col2 as keys. But my problem is, for example, on the second row, on tab1, I have (A, 2018) but on tab2, I have (A, 2019 or 2017) so the row will be NaN in the merged table.

So, my question is how can I fill those row by the nearest corresponding year from tab2? Instead of being an NaN row, it will be, for example, filled with (A, 2019).

So the result would be something like this :

merged_tab
col1 col2  col3
A    2017  foo
A    2018  fii
A    2019  fii
B    2017  boo
B    2019  boo
C    2017  coo
C    2018  cii

Thank you!

答案1

得分: 1

这看起来像是一个 merge_asof。

在最近的前一个值上：

out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
                     tab2.sort_values(by='col2'),
                     on='col2', by='col1', direction='backward')
         .set_index('index').reindex(tab1.index)
      )

输出：

  col1  col2 col3
0    A  2017  foo
1    A  2018  foo  # 2018缺失，让我们取2017
2    A  2019  fii
3    B  2017  boo
4    B  2019  boo
5    C  2017  coo
6    C  2018  cii

在最近的后一个值上：

out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
                     tab2.sort_values(by='col2'),
                     on='col2', by='col1', direction='forward')
         .set_index('index').reindex(tab1.index)
      )

输出：

  col1  col2 col3
0    A  2017  foo
1    A  2018  fii  # 2018缺失，让我们取2019
2    A  2019  fii
3    B  2017  boo
4    B  2019  bii
5    C  2017  coo
6    C  2018  cii

在最近的整体值上：

out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
                     tab2.sort_values(by='col2'),
                     on='col2', by='col1', direction='nearest')
         .set_index('index').reindex(tab1.index)
      )

输出：

  col1  col2 col3
0    A  2017  foo
1    A  2018  foo  # 2017/2019等距，让我们取2017
2    A  2019  fii
3    B  2017  boo
4    B  2019  bii  # 2020比2017更接近
5    C  2017  coo
6    C  2018  cii

如果您想在最接近的值上合并，并在相等情况下选择向前的值：

out = (pd.merge_asof(tab1.reset_index().eval('col2=-col2').sort_values(by='col2'),
                     tab2.eval('col2=-col2').sort_values(by='col2'),
                     on='col2', by='col1', direction='nearest')
         .set_index('index').reindex(tab1.index)
         .eval('col2=-col2')
      )

输出：

  col1  col2 col3
0    A  2017  foo
1    A  2018  fii  # 2017和2019等距，给予2019优先
2    A  2019  fii
3    B  2017  boo
4    B  2019  bii  # 2020比2017更接近
5    C  2017  coo
6    C  2018  cii

英文:

This looks like a merge_asof.

Here on the nearest previous value:

out = (pd.merge_asof(tab1.reset_index().sort_values(by=&#39;col2&#39;),
                     tab2.sort_values(by=&#39;col2&#39;),
                     on=&#39;col2&#39;, by=&#39;col1&#39;, direction=&#39;backward&#39;)
         .set_index(&#39;index&#39;).reindex(tab1.index)
      )

Output:

  col1  col2 col3
0    A  2017  foo
1    A  2018  foo  # 2018 is absent, let&#39;s take 2017
2    A  2019  fii
3    B  2017  boo
4    B  2019  boo
5    C  2017  coo
6    C  2018  cii

One the nearest following value:

out = (pd.merge_asof(tab1.reset_index().sort_values(by=&#39;col2&#39;),
                     tab2.sort_values(by=&#39;col2&#39;),
                     on=&#39;col2&#39;, by=&#39;col1&#39;, direction=&#39;forward&#39;)
         .set_index(&#39;index&#39;).reindex(tab1.index)
      )

Output:

  col1  col2 col3
0    A  2017  foo
1    A  2018  fii  # 2018 is absent, let&#39;s take 2019
2    A  2019  fii
3    B  2017  boo
4    B  2019  bii
5    C  2017  coo
6    C  2018  cii

Here on the nearest overall value:

out = (pd.merge_asof(tab1.reset_index().sort_values(by=&#39;col2&#39;),
                     tab2.sort_values(by=&#39;col2&#39;),
                     on=&#39;col2&#39;, by=&#39;col1&#39;, direction=&#39;nearest&#39;)
         .set_index(&#39;index&#39;).reindex(tab1.index)
      )

Output:

  col1  col2 col3
0    A  2017  foo
1    A  2018  foo  # 2017/2019 are equidistant, let&#39;s take 2017
2    A  2019  fii
3    B  2017  boo
4    B  2019  bii  # 2020 is closer than 2017
5    C  2017  coo
6    C  2018  cii

And if you want to merge on the nearest value, with the forward one in case of equality:

out = (pd.merge_asof(tab1.reset_index().eval(&#39;col2=-col2&#39;).sort_values(by=&#39;col2&#39;),
                     tab2.eval(&#39;col2=-col2&#39;).sort_values(by=&#39;col2&#39;),
                     on=&#39;col2&#39;, by=&#39;col1&#39;, direction=&#39;nearest&#39;)
         .set_index(&#39;index&#39;).reindex(tab1.index)
         .eval(&#39;col2=-col2&#39;)
      )

Output:

  col1  col2 col3
0    A  2017  foo
1    A  2018  fii  # 2017 and 2019 are equality distant, give 2019 priority 
2    A  2019  fii
3    B  2017  boo
4    B  2019  bii  # 2020 is closer than 2017
5    C  2017  coo
6    C  2018  cii

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

合并具有不同的两个键的pandas数据帧

问题

答案1

为什么 os.path 引用项目路径而不是文件路径？

ruamel.yaml.representer.RepresenterError – 为什么 ruamel.yaml 不能表示一个 np.array？

如何使用Python向Google表格中追加数据？

如何使用多列作为嵌套字典的映射，以创建新的数据框列？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论