合并具有不同的两个键的pandas数据帧

huangapple go评论110阅读模式
英文:

Merging pandas dataframes with 2 keys that are not the same

问题

我有这两个表格:

  1. tab1 tab2
  2. col1 col2 col1 col2 col3
  3. A 2017 A 2017 foo
  4. A 2018 A 2019 fii
  5. A 2019 A 2020 fee
  6. B 2017 B 2017 boo
  7. B 2019 B 2020 bii
  8. C 2017 C 2017 coo
  9. C 2018 C 2018 cii

我想使用Python合并这两个表格,以"col1"和"col2"作为关键字。但是我的问题是,例如,在第二行,tab1中有(A, 2018),但在tab2中有(A, 2019或2017),所以合并后的表格中该行将为"NaN"。

所以,我的问题是如何用"tab2"中最接近的对应年份来填充这些行?而不是成为"NaN"行,它将被填充为(A, 2019)。

因此,结果可能如下所示:

  1. merged_tab
  2. col1 col2 col3
  3. A 2017 foo
  4. A 2018 fii
  5. A 2019 fii
  6. B 2017 boo
  7. B 2019 boo
  8. C 2017 coo
  9. C 2018 cii

谢谢!

英文:

I have these 2 tables :

  1. tab1 tab2
  2. col1 col2 col1 col2 col3
  3. A 2017 A 2017 foo
  4. A 2018 A 2019 fii
  5. A 2019 A 2020 fee
  6. B 2017 B 2017 boo
  7. B 2019 B 2020 bii
  8. C 2017 C 2017 coo
  9. C 2018 C 2018 cii

I want to merge using Python these two tables with both col1 and col2 as keys. But my problem is, for example, on the second row, on tab1, I have (A, 2018) but on tab2, I have (A, 2019 or 2017) so the row will be NaN in the merged table.

So, my question is how can I fill those row by the nearest corresponding year from tab2? Instead of being an NaN row, it will be, for example, filled with (A, 2019).

So the result would be something like this :

  1. merged_tab
  2. col1 col2 col3
  3. A 2017 foo
  4. A 2018 fii
  5. A 2019 fii
  6. B 2017 boo
  7. B 2019 boo
  8. C 2017 coo
  9. C 2018 cii

Thank you!

答案1

得分: 1

这看起来像是一个 merge_asof

在最近的前一个值上:

  1. out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
  2. tab2.sort_values(by='col2'),
  3. on='col2', by='col1', direction='backward')
  4. .set_index('index').reindex(tab1.index)
  5. )

输出:

  1. col1 col2 col3
  2. 0 A 2017 foo
  3. 1 A 2018 foo # 2018缺失,让我们取2017
  4. 2 A 2019 fii
  5. 3 B 2017 boo
  6. 4 B 2019 boo
  7. 5 C 2017 coo
  8. 6 C 2018 cii

在最近的后一个值上:

  1. out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
  2. tab2.sort_values(by='col2'),
  3. on='col2', by='col1', direction='forward')
  4. .set_index('index').reindex(tab1.index)
  5. )

输出:

  1. col1 col2 col3
  2. 0 A 2017 foo
  3. 1 A 2018 fii # 2018缺失,让我们取2019
  4. 2 A 2019 fii
  5. 3 B 2017 boo
  6. 4 B 2019 bii
  7. 5 C 2017 coo
  8. 6 C 2018 cii

在最近的整体值上:

  1. out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
  2. tab2.sort_values(by='col2'),
  3. on='col2', by='col1', direction='nearest')
  4. .set_index('index').reindex(tab1.index)
  5. )

输出:

  1. col1 col2 col3
  2. 0 A 2017 foo
  3. 1 A 2018 foo # 2017/2019等距,让我们取2017
  4. 2 A 2019 fii
  5. 3 B 2017 boo
  6. 4 B 2019 bii # 2020比2017更接近
  7. 5 C 2017 coo
  8. 6 C 2018 cii

如果您想在最接近的值上合并,并在相等情况下选择向前的值:

  1. out = (pd.merge_asof(tab1.reset_index().eval('col2=-col2').sort_values(by='col2'),
  2. tab2.eval('col2=-col2').sort_values(by='col2'),
  3. on='col2', by='col1', direction='nearest')
  4. .set_index('index').reindex(tab1.index)
  5. .eval('col2=-col2')
  6. )

输出:

  1. col1 col2 col3
  2. 0 A 2017 foo
  3. 1 A 2018 fii # 2017和2019等距,给予2019优先
  4. 2 A 2019 fii
  5. 3 B 2017 boo
  6. 4 B 2019 bii # 2020比2017更接近
  7. 5 C 2017 coo
  8. 6 C 2018 cii
英文:

This looks like a merge_asof.

Here on the nearest previous value:

  1. out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
  2. tab2.sort_values(by='col2'),
  3. on='col2', by='col1', direction='backward')
  4. .set_index('index').reindex(tab1.index)
  5. )

Output:

  1. col1 col2 col3
  2. 0 A 2017 foo
  3. 1 A 2018 foo # 2018 is absent, let's take 2017
  4. 2 A 2019 fii
  5. 3 B 2017 boo
  6. 4 B 2019 boo
  7. 5 C 2017 coo
  8. 6 C 2018 cii

One the nearest following value:

  1. out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
  2. tab2.sort_values(by='col2'),
  3. on='col2', by='col1', direction='forward')
  4. .set_index('index').reindex(tab1.index)
  5. )

Output:

  1. col1 col2 col3
  2. 0 A 2017 foo
  3. 1 A 2018 fii # 2018 is absent, let's take 2019
  4. 2 A 2019 fii
  5. 3 B 2017 boo
  6. 4 B 2019 bii
  7. 5 C 2017 coo
  8. 6 C 2018 cii

Here on the nearest overall value:

  1. out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
  2. tab2.sort_values(by='col2'),
  3. on='col2', by='col1', direction='nearest')
  4. .set_index('index').reindex(tab1.index)
  5. )

Output:

  1. col1 col2 col3
  2. 0 A 2017 foo
  3. 1 A 2018 foo # 2017/2019 are equidistant, let's take 2017
  4. 2 A 2019 fii
  5. 3 B 2017 boo
  6. 4 B 2019 bii # 2020 is closer than 2017
  7. 5 C 2017 coo
  8. 6 C 2018 cii

And if you want to merge on the nearest value, with the forward one in case of equality:

  1. out = (pd.merge_asof(tab1.reset_index().eval('col2=-col2').sort_values(by='col2'),
  2. tab2.eval('col2=-col2').sort_values(by='col2'),
  3. on='col2', by='col1', direction='nearest')
  4. .set_index('index').reindex(tab1.index)
  5. .eval('col2=-col2')
  6. )

Output:

  1. col1 col2 col3
  2. 0 A 2017 foo
  3. 1 A 2018 fii # 2017 and 2019 are equality distant, give 2019 priority
  4. 2 A 2019 fii
  5. 3 B 2017 boo
  6. 4 B 2019 bii # 2020 is closer than 2017
  7. 5 C 2017 coo
  8. 6 C 2018 cii

huangapple
  • 本文由 发表于 2023年7月12日 20:48:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76670744.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定