我如何高效地合并这些具有范围值的数据框?

huangapple go评论93阅读模式
英文:

How can I efficiently merge these dataframes on range values?

问题

我有两个数据框:

  1. section_headers =
  2. start_sect_ end_sect_
  3. 0 0 50
  4. 1 121 139
  5. 2 221 270
  6. sentences =
  7. start_sent_ end_sent_
  8. 0 0 50
  9. 1 56 76
  10. 2 77 85
  11. 3 88 111
  12. 4 114 120
  13. 5 121 139
  14. 6 221 270

我试图合并属于每个section_header下的sentences...

一句话属于一个section_header当且仅当它的start_sent_大于等于section_headerstart_sect_并且小于等于下一个section_headerstart_sect_,以此类推。

根据此条件,我的期望输出是:

  1. merge =
  2. start_sent_ end_sent_ start_sect_
  3. 0 0 50 0
  4. 1 56 76 0
  5. 2 77 85 0
  6. 3 88 111 0
  7. 4 114 120 0
  8. 5 121 139 121
  9. 6 221 270 221

我最初将其转换为字典,然后基于条件创建了一个新的数据框,但我处理的数据量非常大,遍历记录需要很长时间。

我正在尝试想出一种方法,以避免不必要地遍历这些记录来合并数据。我尝试了这里的广播方法 Solution 2: Numpy Solution for large dataset,但由于此方法不允许对数组进行索引,所以无法使用。否则,它对我有两个其他合并用例的情况非常有效。

英文:

I have two dataframes:

  1. section_headers =
  2. start_sect_ end_sect_
  3. 0 0 50
  4. 1 121 139
  5. 2 221 270
  6. sentences =
  7. start_sent_ end_sent_
  8. 0 0 50
  9. 1 56 76
  10. 2 77 85
  11. 3 88 111
  12. 4 114 120
  13. 5 121 139
  14. 6 221 270

I'm trying to merge sentences that belongs under each section_header...

A sentence belongs under a section_header when its start_sent_ is greater than or equal to that of a section_header's start_sect_ and less than or equal to the next section_header's start_sect_, etc.

Given this, my desired output is:

  1. merge =
  2. start_sent_ end_sent_ start_sect_
  3. 0 0 50 0
  4. 1 56 76 0
  5. 2 77 85 0
  6. 3 88 111 0
  7. 4 114 120 0
  8. 5 121 139 121
  9. 6 221 270 221

I initially converted this to a dictionary and then created a new dataframe based on the conditions, but the amount of data I'm dealing with was very large and it took forever to iterate through the records.

I'm trying to devise a way to not have to iterate through these records to do a merge of the data. I tried the broadcast method here Solution 2: Numpy Solution for large dataset, but since this method doesn't allow indexing of the arrays, it doesn't work. Otherwise, it works great for two other merge use cases I have.

答案1

得分: 1

这似乎可以使用 merge_asof 函数来实现。

使用 direction="backward",以 section_headers 作为右侧数据框,在 <= 行上进行合并:

  1. pd.merge_asof(sentences, section_headers["start_sect_"],
  2. left_on="start_sent_", right_on="start_sect_",
  3. direction="backward")
  4. #Out[]:
  5. # start_sent_ end_sent_ start_sect_
  6. #0 0 50 0
  7. #1 56 76 0
  8. #2 77 85 0
  9. #3 88 111 0
  10. #4 114 120 0
  11. #5 121 139 121
  12. #6 221 270 221
英文:

This looks like a use for merge_asof.

Using direction=&quot;backward&quot;, with section_headers as the right DF, the merge will be on the &lt;= to row:

  1. pd.merge_asof(sentences, section_headers[&quot;start_sect_&quot;],
  2. left_on=&quot;start_sent_&quot;, right_on=&quot;start_sect_&quot;,
  3. direction=&quot;backward&quot;)
  4. #Out[]:
  5. # start_sent_ end_sent_ start_sect_
  6. #0 0 50 0
  7. #1 56 76 0
  8. #2 77 85 0
  9. #3 88 111 0
  10. #4 114 120 0
  11. #5 121 139 121
  12. #6 221 270 221

huangapple
  • 本文由 发表于 2023年5月25日 07:24:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76327976.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定