英文:
How can I efficiently merge these dataframes on range values?
问题
我有两个数据框:
section_headers =
start_sect_ end_sect_
0 0 50
1 121 139
2 221 270
sentences =
start_sent_ end_sent_
0 0 50
1 56 76
2 77 85
3 88 111
4 114 120
5 121 139
6 221 270
我试图合并属于每个section_header
下的sentences
...
一句话属于一个section_header
当且仅当它的start_sent_
大于等于section_header
的start_sect_
并且小于等于下一个section_header
的start_sect_
,以此类推。
根据此条件,我的期望输出是:
merge =
start_sent_ end_sent_ start_sect_
0 0 50 0
1 56 76 0
2 77 85 0
3 88 111 0
4 114 120 0
5 121 139 121
6 221 270 221
我最初将其转换为字典,然后基于条件创建了一个新的数据框,但我处理的数据量非常大,遍历记录需要很长时间。
我正在尝试想出一种方法,以避免不必要地遍历这些记录来合并数据。我尝试了这里的广播方法 Solution 2: Numpy Solution for large dataset,但由于此方法不允许对数组进行索引,所以无法使用。否则,它对我有两个其他合并用例的情况非常有效。
英文:
I have two dataframes:
section_headers =
start_sect_ end_sect_
0 0 50
1 121 139
2 221 270
sentences =
start_sent_ end_sent_
0 0 50
1 56 76
2 77 85
3 88 111
4 114 120
5 121 139
6 221 270
I'm trying to merge sentences
that belongs under each section_header
...
A sentence belongs under a section_header when its start_sent_ is greater than or equal to that of a section_header's start_sect_ and less than or equal to the next section_header's start_sect_, etc.
Given this, my desired output is:
merge =
start_sent_ end_sent_ start_sect_
0 0 50 0
1 56 76 0
2 77 85 0
3 88 111 0
4 114 120 0
5 121 139 121
6 221 270 221
I initially converted this to a dictionary and then created a new dataframe based on the conditions, but the amount of data I'm dealing with was very large and it took forever to iterate through the records.
I'm trying to devise a way to not have to iterate through these records to do a merge of the data. I tried the broadcast method here Solution 2: Numpy Solution for large dataset, but since this method doesn't allow indexing of the arrays, it doesn't work. Otherwise, it works great for two other merge use cases I have.
答案1
得分: 1
这似乎可以使用 merge_asof
函数来实现。
使用 direction="backward"
,以 section_headers
作为右侧数据框,在 <=
行上进行合并:
pd.merge_asof(sentences, section_headers["start_sect_"],
left_on="start_sent_", right_on="start_sect_",
direction="backward")
#Out[]:
# start_sent_ end_sent_ start_sect_
#0 0 50 0
#1 56 76 0
#2 77 85 0
#3 88 111 0
#4 114 120 0
#5 121 139 121
#6 221 270 221
英文:
This looks like a use for merge_asof
.
Using direction="backward"
, with section_headers
as the right DF, the merge will be on the <=
to row:
pd.merge_asof(sentences, section_headers["start_sect_"],
left_on="start_sent_", right_on="start_sect_",
direction="backward")
#Out[]:
# start_sent_ end_sent_ start_sect_
#0 0 50 0
#1 56 76 0
#2 77 85 0
#3 88 111 0
#4 114 120 0
#5 121 139 121
#6 221 270 221
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论