英文:
Merging pandas dataframes with 2 keys that are not the same
问题
我有这两个表格:
tab1 tab2
col1 col2 col1 col2 col3
A 2017 A 2017 foo
A 2018 A 2019 fii
A 2019 A 2020 fee
B 2017 B 2017 boo
B 2019 B 2020 bii
C 2017 C 2017 coo
C 2018 C 2018 cii
我想使用Python合并这两个表格,以"col1"和"col2"作为关键字。但是我的问题是,例如,在第二行,tab1中有(A, 2018),但在tab2中有(A, 2019或2017),所以合并后的表格中该行将为"NaN"。
所以,我的问题是如何用"tab2"中最接近的对应年份来填充这些行?而不是成为"NaN"行,它将被填充为(A, 2019)。
因此,结果可能如下所示:
merged_tab
col1 col2 col3
A 2017 foo
A 2018 fii
A 2019 fii
B 2017 boo
B 2019 boo
C 2017 coo
C 2018 cii
谢谢!
英文:
I have these 2 tables :
tab1 tab2
col1 col2 col1 col2 col3
A 2017 A 2017 foo
A 2018 A 2019 fii
A 2019 A 2020 fee
B 2017 B 2017 boo
B 2019 B 2020 bii
C 2017 C 2017 coo
C 2018 C 2018 cii
I want to merge using Python these two tables with both col1
and col2
as keys. But my problem is, for example, on the second row, on tab1, I have (A, 2018) but on tab2, I have (A, 2019 or 2017) so the row will be NaN
in the merged table.
So, my question is how can I fill those row by the nearest corresponding year from tab2
? Instead of being an NaN
row, it will be, for example, filled with (A, 2019).
So the result would be something like this :
merged_tab
col1 col2 col3
A 2017 foo
A 2018 fii
A 2019 fii
B 2017 boo
B 2019 boo
C 2017 coo
C 2018 cii
Thank you!
答案1
得分: 1
这看起来像是一个 merge_asof
。
在最近的前一个值上:
out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
tab2.sort_values(by='col2'),
on='col2', by='col1', direction='backward')
.set_index('index').reindex(tab1.index)
)
输出:
col1 col2 col3
0 A 2017 foo
1 A 2018 foo # 2018缺失,让我们取2017
2 A 2019 fii
3 B 2017 boo
4 B 2019 boo
5 C 2017 coo
6 C 2018 cii
在最近的后一个值上:
out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
tab2.sort_values(by='col2'),
on='col2', by='col1', direction='forward')
.set_index('index').reindex(tab1.index)
)
输出:
col1 col2 col3
0 A 2017 foo
1 A 2018 fii # 2018缺失,让我们取2019
2 A 2019 fii
3 B 2017 boo
4 B 2019 bii
5 C 2017 coo
6 C 2018 cii
在最近的整体值上:
out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
tab2.sort_values(by='col2'),
on='col2', by='col1', direction='nearest')
.set_index('index').reindex(tab1.index)
)
输出:
col1 col2 col3
0 A 2017 foo
1 A 2018 foo # 2017/2019等距,让我们取2017
2 A 2019 fii
3 B 2017 boo
4 B 2019 bii # 2020比2017更接近
5 C 2017 coo
6 C 2018 cii
如果您想在最接近的值上合并,并在相等情况下选择向前的值:
out = (pd.merge_asof(tab1.reset_index().eval('col2=-col2').sort_values(by='col2'),
tab2.eval('col2=-col2').sort_values(by='col2'),
on='col2', by='col1', direction='nearest')
.set_index('index').reindex(tab1.index)
.eval('col2=-col2')
)
输出:
col1 col2 col3
0 A 2017 foo
1 A 2018 fii # 2017和2019等距,给予2019优先
2 A 2019 fii
3 B 2017 boo
4 B 2019 bii # 2020比2017更接近
5 C 2017 coo
6 C 2018 cii
英文:
This looks like a merge_asof
.
Here on the nearest previous value:
out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
tab2.sort_values(by='col2'),
on='col2', by='col1', direction='backward')
.set_index('index').reindex(tab1.index)
)
Output:
col1 col2 col3
0 A 2017 foo
1 A 2018 foo # 2018 is absent, let's take 2017
2 A 2019 fii
3 B 2017 boo
4 B 2019 boo
5 C 2017 coo
6 C 2018 cii
One the nearest following value:
out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
tab2.sort_values(by='col2'),
on='col2', by='col1', direction='forward')
.set_index('index').reindex(tab1.index)
)
Output:
col1 col2 col3
0 A 2017 foo
1 A 2018 fii # 2018 is absent, let's take 2019
2 A 2019 fii
3 B 2017 boo
4 B 2019 bii
5 C 2017 coo
6 C 2018 cii
Here on the nearest overall value:
out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
tab2.sort_values(by='col2'),
on='col2', by='col1', direction='nearest')
.set_index('index').reindex(tab1.index)
)
Output:
col1 col2 col3
0 A 2017 foo
1 A 2018 foo # 2017/2019 are equidistant, let's take 2017
2 A 2019 fii
3 B 2017 boo
4 B 2019 bii # 2020 is closer than 2017
5 C 2017 coo
6 C 2018 cii
And if you want to merge on the nearest value, with the forward one in case of equality:
out = (pd.merge_asof(tab1.reset_index().eval('col2=-col2').sort_values(by='col2'),
tab2.eval('col2=-col2').sort_values(by='col2'),
on='col2', by='col1', direction='nearest')
.set_index('index').reindex(tab1.index)
.eval('col2=-col2')
)
Output:
col1 col2 col3
0 A 2017 foo
1 A 2018 fii # 2017 and 2019 are equality distant, give 2019 priority
2 A 2019 fii
3 B 2017 boo
4 B 2019 bii # 2020 is closer than 2017
5 C 2017 coo
6 C 2018 cii
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论