2023年5月25日 07:24:44go评论93阅读模式

英文:

How can I efficiently merge these dataframes on range values?

问题

我有两个数据框：

section_headers =
   start_sect_  end_sect_
0            0         50
1          121        139
2          221        270
sentences =
    start_sent_  end_sent_
0             0         50
1            56         76
2            77         85
3            88        111
4           114        120
5           121        139
6           221        270

我试图合并属于每个section_header下的sentences...

一句话属于一个section_header当且仅当它的start_sent_大于等于section_header的start_sect_并且小于等于下一个section_header的start_sect_，以此类推。

根据此条件，我的期望输出是：

merge =
        start_sent_  end_sent_  start_sect_
0             0         50           0
1            56         76           0
2            77         85           0
3            88        111           0
4           114        120           0
5           121        139         121
6           221        270         221

我最初将其转换为字典，然后基于条件创建了一个新的数据框，但我处理的数据量非常大，遍历记录需要很长时间。

我正在尝试想出一种方法，以避免不必要地遍历这些记录来合并数据。我尝试了这里的广播方法 Solution 2: Numpy Solution for large dataset，但由于此方法不允许对数组进行索引，所以无法使用。否则，它对我有两个其他合并用例的情况非常有效。

英文:

I have two dataframes:

section_headers =
   start_sect_  end_sect_
0            0         50
1          121        139
2          221        270
sentences =
    start_sent_  end_sent_
0             0         50
1            56         76
2            77         85
3            88        111
4           114        120
5           121        139
6           221        270

I'm trying to merge sentences that belongs under each section_header...

A sentence belongs under a section_header when its start_sent_ is greater than or equal to that of a section_header's start_sect_ and less than or equal to the next section_header's start_sect_, etc.

Given this, my desired output is:

merge =
        start_sent_  end_sent_     start_sect_
    0             0         50               0
    1            56         76               0
    2            77         85               0
    3            88        111               0
    4           114        120               0
    5           121        139               121
    6           221        270               221

I initially converted this to a dictionary and then created a new dataframe based on the conditions, but the amount of data I'm dealing with was very large and it took forever to iterate through the records.

I'm trying to devise a way to not have to iterate through these records to do a merge of the data. I tried the broadcast method here Solution 2: Numpy Solution for large dataset, but since this method doesn't allow indexing of the arrays, it doesn't work. Otherwise, it works great for two other merge use cases I have.

答案1

得分: 1

这似乎可以使用 merge_asof 函数来实现。

使用 direction="backward"，以 section_headers 作为右侧数据框，在 <= 行上进行合并：

pd.merge_asof(sentences, section_headers["start_sect_"],
              left_on="start_sent_", right_on="start_sect_",
              direction="backward")
#Out[]: 
#   start_sent_  end_sent_  start_sect_
#0            0         50            0
#1           56         76            0
#2           77         85            0
#3           88        111            0
#4          114        120            0
#5          121        139          121
#6          221        270          221

英文:

This looks like a use for merge_asof.

Using direction="backward", with section_headers as the right DF, the merge will be on the <= to row:

pd.merge_asof(sentences, section_headers[&quot;start_sect_&quot;],
              left_on=&quot;start_sent_&quot;, right_on=&quot;start_sect_&quot;,
              direction=&quot;backward&quot;)
#Out[]: 
#   start_sent_  end_sent_  start_sect_
#0            0         50            0
#1           56         76            0
#2           77         85            0
#3           88        111            0
#4          114        120            0
#5          121        139          121
#6          221        270          221

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

我如何高效地合并这些具有范围值的数据框？

问题

答案1

根据另一列中的前一行对 Pandas DataFrame 进行排序

传递字典形式的多个参数给函数

有没有办法使用多线程来写入同一个CSV文件的不同列？

坚持使用PHP还是学习Go语言？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。