我如何高效地合并这些具有范围值的数据框?

huangapple go评论60阅读模式
英文:

How can I efficiently merge these dataframes on range values?

问题

我有两个数据框:

section_headers =
   start_sect_  end_sect_
0            0         50
1          121        139
2          221        270

sentences =
    start_sent_  end_sent_
0             0         50
1            56         76
2            77         85
3            88        111
4           114        120
5           121        139
6           221        270

我试图合并属于每个section_header下的sentences...

一句话属于一个section_header当且仅当它的start_sent_大于等于section_headerstart_sect_并且小于等于下一个section_headerstart_sect_,以此类推。

根据此条件,我的期望输出是:

merge =
        start_sent_  end_sent_  start_sect_
0             0         50           0
1            56         76           0
2            77         85           0
3            88        111           0
4           114        120           0
5           121        139         121
6           221        270         221

我最初将其转换为字典,然后基于条件创建了一个新的数据框,但我处理的数据量非常大,遍历记录需要很长时间。

我正在尝试想出一种方法,以避免不必要地遍历这些记录来合并数据。我尝试了这里的广播方法 Solution 2: Numpy Solution for large dataset,但由于此方法不允许对数组进行索引,所以无法使用。否则,它对我有两个其他合并用例的情况非常有效。

英文:

I have two dataframes:

section_headers =
   start_sect_  end_sect_
0            0         50
1          121        139
2          221        270


sentences =
    start_sent_  end_sent_
0             0         50
1            56         76
2            77         85
3            88        111
4           114        120
5           121        139
6           221        270

I'm trying to merge sentences that belongs under each section_header...

A sentence belongs under a section_header when its start_sent_ is greater than or equal to that of a section_header's start_sect_ and less than or equal to the next section_header's start_sect_, etc.

Given this, my desired output is:

merge =
        start_sent_  end_sent_     start_sect_
    0             0         50               0
    1            56         76               0
    2            77         85               0
    3            88        111               0
    4           114        120               0
    5           121        139               121
    6           221        270               221 

I initially converted this to a dictionary and then created a new dataframe based on the conditions, but the amount of data I'm dealing with was very large and it took forever to iterate through the records.

I'm trying to devise a way to not have to iterate through these records to do a merge of the data. I tried the broadcast method here Solution 2: Numpy Solution for large dataset, but since this method doesn't allow indexing of the arrays, it doesn't work. Otherwise, it works great for two other merge use cases I have.

答案1

得分: 1

这似乎可以使用 merge_asof 函数来实现。

使用 direction="backward",以 section_headers 作为右侧数据框,在 <= 行上进行合并:

pd.merge_asof(sentences, section_headers["start_sect_"],
              left_on="start_sent_", right_on="start_sect_",
              direction="backward")

#Out[]: 
#   start_sent_  end_sent_  start_sect_
#0            0         50            0
#1           56         76            0
#2           77         85            0
#3           88        111            0
#4          114        120            0
#5          121        139          121
#6          221        270          221
英文:

This looks like a use for merge_asof.

Using direction=&quot;backward&quot;, with section_headers as the right DF, the merge will be on the &lt;= to row:

pd.merge_asof(sentences, section_headers[&quot;start_sect_&quot;],
              left_on=&quot;start_sent_&quot;, right_on=&quot;start_sect_&quot;,
              direction=&quot;backward&quot;)

#Out[]: 
#   start_sent_  end_sent_  start_sect_
#0            0         50            0
#1           56         76            0
#2           77         85            0
#3           88        111            0
#4          114        120            0
#5          121        139          121
#6          221        270          221

huangapple
  • 本文由 发表于 2023年5月25日 07:24:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76327976.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定