合并具有不同的两个键的pandas数据帧

huangapple go评论64阅读模式
英文:

Merging pandas dataframes with 2 keys that are not the same

问题

我有这两个表格:

tab1         tab2
col1 col2    col1 col2  col3
A    2017    A    2017  foo
A    2018    A    2019  fii
A    2019    A    2020  fee
B    2017    B    2017  boo
B    2019    B    2020  bii
C    2017    C    2017  coo
C    2018    C    2018  cii

我想使用Python合并这两个表格,以"col1"和"col2"作为关键字。但是我的问题是,例如,在第二行,tab1中有(A, 2018),但在tab2中有(A, 2019或2017),所以合并后的表格中该行将为"NaN"。

所以,我的问题是如何用"tab2"中最接近的对应年份来填充这些行?而不是成为"NaN"行,它将被填充为(A, 2019)。

因此,结果可能如下所示:

merged_tab
col1 col2  col3
A    2017  foo
A    2018  fii
A    2019  fii
B    2017  boo
B    2019  boo
C    2017  coo
C    2018  cii

谢谢!

英文:

I have these 2 tables :

tab1         tab2
col1 col2    col1 col2  col3
A    2017    A    2017  foo
A    2018    A    2019  fii
A    2019    A    2020  fee
B    2017    B    2017  boo
B    2019    B    2020  bii
C    2017    C    2017  coo
C    2018    C    2018  cii

I want to merge using Python these two tables with both col1 and col2 as keys. But my problem is, for example, on the second row, on tab1, I have (A, 2018) but on tab2, I have (A, 2019 or 2017) so the row will be NaN in the merged table.

So, my question is how can I fill those row by the nearest corresponding year from tab2? Instead of being an NaN row, it will be, for example, filled with (A, 2019).

So the result would be something like this :

merged_tab
col1 col2  col3
A    2017  foo
A    2018  fii
A    2019  fii
B    2017  boo
B    2019  boo
C    2017  coo
C    2018  cii

Thank you!

答案1

得分: 1

这看起来像是一个 merge_asof

在最近的前一个值上:

out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
                     tab2.sort_values(by='col2'),
                     on='col2', by='col1', direction='backward')
         .set_index('index').reindex(tab1.index)
      )

输出:

  col1  col2 col3
0    A  2017  foo
1    A  2018  foo  # 2018缺失,让我们取2017
2    A  2019  fii
3    B  2017  boo
4    B  2019  boo
5    C  2017  coo
6    C  2018  cii

在最近的后一个值上:

out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
                     tab2.sort_values(by='col2'),
                     on='col2', by='col1', direction='forward')
         .set_index('index').reindex(tab1.index)
      )

输出:

  col1  col2 col3
0    A  2017  foo
1    A  2018  fii  # 2018缺失,让我们取2019
2    A  2019  fii
3    B  2017  boo
4    B  2019  bii
5    C  2017  coo
6    C  2018  cii

在最近的整体值上:

out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
                     tab2.sort_values(by='col2'),
                     on='col2', by='col1', direction='nearest')
         .set_index('index').reindex(tab1.index)
      )

输出:

  col1  col2 col3
0    A  2017  foo
1    A  2018  foo  # 2017/2019等距,让我们取2017
2    A  2019  fii
3    B  2017  boo
4    B  2019  bii  # 2020比2017更接近
5    C  2017  coo
6    C  2018  cii

如果您想在最接近的值上合并,并在相等情况下选择向前的值:

out = (pd.merge_asof(tab1.reset_index().eval('col2=-col2').sort_values(by='col2'),
                     tab2.eval('col2=-col2').sort_values(by='col2'),
                     on='col2', by='col1', direction='nearest')
         .set_index('index').reindex(tab1.index)
         .eval('col2=-col2')
      )

输出:

  col1  col2 col3
0    A  2017  foo
1    A  2018  fii  # 2017和2019等距,给予2019优先
2    A  2019  fii
3    B  2017  boo
4    B  2019  bii  # 2020比2017更接近
5    C  2017  coo
6    C  2018  cii
英文:

This looks like a merge_asof.

Here on the nearest previous value:

out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
                     tab2.sort_values(by='col2'),
                     on='col2', by='col1', direction='backward')
         .set_index('index').reindex(tab1.index)
      )

Output:

  col1  col2 col3
0    A  2017  foo
1    A  2018  foo  # 2018 is absent, let's take 2017
2    A  2019  fii
3    B  2017  boo
4    B  2019  boo
5    C  2017  coo
6    C  2018  cii

One the nearest following value:

out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
                     tab2.sort_values(by='col2'),
                     on='col2', by='col1', direction='forward')
         .set_index('index').reindex(tab1.index)
      )

Output:

  col1  col2 col3
0    A  2017  foo
1    A  2018  fii  # 2018 is absent, let's take 2019
2    A  2019  fii
3    B  2017  boo
4    B  2019  bii
5    C  2017  coo
6    C  2018  cii

Here on the nearest overall value:

out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
                     tab2.sort_values(by='col2'),
                     on='col2', by='col1', direction='nearest')
         .set_index('index').reindex(tab1.index)
      )

Output:

  col1  col2 col3
0    A  2017  foo
1    A  2018  foo  # 2017/2019 are equidistant, let's take 2017
2    A  2019  fii
3    B  2017  boo
4    B  2019  bii  # 2020 is closer than 2017
5    C  2017  coo
6    C  2018  cii

And if you want to merge on the nearest value, with the forward one in case of equality:

out = (pd.merge_asof(tab1.reset_index().eval('col2=-col2').sort_values(by='col2'),
                     tab2.eval('col2=-col2').sort_values(by='col2'),
                     on='col2', by='col1', direction='nearest')
         .set_index('index').reindex(tab1.index)
         .eval('col2=-col2')
      )

Output:

  col1  col2 col3
0    A  2017  foo
1    A  2018  fii  # 2017 and 2019 are equality distant, give 2019 priority 
2    A  2019  fii
3    B  2017  boo
4    B  2019  bii  # 2020 is closer than 2017
5    C  2017  coo
6    C  2018  cii

huangapple
  • 本文由 发表于 2023年7月12日 20:48:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76670744.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定