2023年7月7日 06:01:22go评论111阅读模式

英文:

How to optimise an inequality join in Pandas?

问题

我已经为您翻译好了代码部分：

我有一个类似于*交叉连接操作*的操作，我已经使用for循环实现了它。我需要让它变得快速，最好还要优雅。它根据日期范围条件每天创建一个块条目。
这对于**小数据集**效果很好，但对于**大数据集**来说，完全陷入了非常慢的运行时。我知道它可以向量化。我的实现非常糟糕。
*我已经看过其他帖子*，关于如何在数据框中向量化循环。我阅读了[10分钟入门pandas][1]，如此帖子[如何在Pandas中迭代行][2]所建议，尝试使用lambda函数。还尝试了Cython。我就是无法搞定它。
我尝试了实现[pandas.MultiIndex.to_frame][3]，我有强烈的感觉，这或者它的一个类似物是一个不错的选择。我也尝试了许多其他方法，但都没有成功。
我想要学会优雅地编写代码。欢迎提供所有建议、解决方案的变化以及评论。
```python
from datetime import datetime
import pandas as pd 
beginning = pd.to_datetime('14/09/2021', dayfirst=True)
today = pd.to_datetime(datetime.today())
date_range = pd.date_range(start=beginning, end=today)  # .tolist()
frame = pd.DataFrame(columns=['Record_Date', 'Identifier', 'start_date', 'end_date', 'color'])
block = pd.DataFrame(
    {'Identifier': ['4913151F', 'F4E9124A', '31715888', 'D0C57FCA', '57B4D7EB', 'E46F1E5D', '99E0A2F8', 'D77E342E',
                    'C596D233', 'D0EED63F', 'D0C57FCA'],
     'start_date': ['03/11/2020', '05/07/2022', '22/12/2016', '17/03/2024', '14/10/2022', '08/08/2022', '04/11/2020',
                    '13/03/2023', '05/11/2021', '12/27/2022', '13/06/2022'],
     'end_date': ['11/07/2023', '11/04/2023', '14/12/2018', '20/01/2025', '15/06/2023', '09/01/2023', '16/07/2022',
                  '19/05/2024', '24/09/2022', '17/11/2023', '13/06/2023'],
     'color': ['red', 'green', 'magenta', 'yellow', 'light_blue', 'dark_blue', 'black', 'white', 'pink', 'orange',
               'yellow']})
block.start_date = pd.to_datetime(block.start_date, dayfirst=True, format='mixed')
block.end_date = pd.to_datetime(block.end_date, dayfirst=True, format='mixed')
block_uniques = block.drop_duplicates(['Identifier', 'start_date'])
for x in date_range:
    temp_df = block_uniques[(block_uniques.start_date <= x) & (block_uniques.end_date >= x)] 
    temp_df.insert(0, 'Record_Date', x)
    frame = pd.concat([frame, temp_df])
frame = frame.sort_values(['Record_Date', 'Identifier'])
frame = frame.reset_index().drop('index', axis=1)
print(frame)

输出和解决方案：

     Record_Date Identifier start_date   end_date   color
0     2021-09-14   4913151F 2020-11-03 2023-07-11     red
1     2021-09-14   99E0A2F8 2020-11-04 2022-07-16   black
2     2021-09-15   4913151F 2020-11-03 2023-07-11     red
3     2021-09-15   99E0A2F8 2020-11-04 2022-07-16   black
4     2021-09-16   4913151F 2020-11-03 2023-07-11     red
...          ...        ...        ...        ...     ...
2641  2023-07-05   D0EED63F 2022-12-27 2023-11-17  orange
2642  2023-07-05   D77E342E 2023-03-13 2024-05-19   white
2643  2023-07-06   4913151F 2020-11-03 2023-07-11     red
2644  2023-07-06   D0EED63F 2022-12-27 2023-11-17  orange
2645  2023-07-06   D77E342E 2023-03-13 2024-05-19   white
[2646 rows x 5 columns]

希望这有助于您的问题。

英文:

I have a cross-join-like operation that I have implemented using a for loop. I need to make it fast and preferably elegant. It creates a block entry per day with a date range condition.

This works fine for small datasets but completely stalls into a very slow runtime for larger datasets. I know that it can be vectorized. My implementation is very bad.

I have looked at the other posts on how to vectorize loops in DataFrames. I read 10 minutes to pandas as per suggested by this post How to iterate over rows in a DataFrame in Pandas, tried using lambda functions. Messed with Cython. I just can't get it.

I tried implementing [pandas.MultiIndex.to_frame] and I have a strong feeling this, or one of it's cousins, is a good way to go. I have also tried a bunch of other things and nothing.

I want to learn to code elegantly. All suggestions, variations on the solution and, comments are welcome.

from datetime import datetime
import pandas as pd 
beginning = pd.to_datetime(&#39;14/09/2021&#39;, dayfirst=True)
today = pd.to_datetime(datetime.today())
date_range = pd.date_range(start=beginning, end=today)  # .tolist()
frame = pd.DataFrame(columns=[&#39;Record_Date&#39;, &#39;Identifier&#39;, &#39;start_date&#39;, &#39;end_date&#39;, &#39;color&#39;])
block = pd.DataFrame(
    {&#39;Identifier&#39;: [&#39;4913151F&#39;, &#39;F4E9124A&#39;, &#39;31715888&#39;, &#39;D0C57FCA&#39;, &#39;57B4D7EB&#39;, &#39;E46F1E5D&#39;, &#39;99E0A2F8&#39;, &#39;D77E342E&#39;,
                    &#39;C596D233&#39;, &#39;D0EED63F&#39;, &#39;D0C57FCA&#39;],
     &#39;start_date&#39;: [&#39;03/11/2020&#39;, &#39;05/07/2022&#39;, &#39;22/12/2016&#39;, &#39;17/03/2024&#39;, &#39;14/10/2022&#39;, &#39;08/08/2022&#39;, &#39;04/11/2020&#39;,
                    &#39;13/03/2023&#39;, &#39;05/11/2021&#39;, &#39;12/27/2022&#39;, &#39;13/06/2022&#39;],
     &#39;end_date&#39;: [&#39;11/07/2023&#39;, &#39;11/04/2023&#39;, &#39;14/12/2018&#39;, &#39;20/01/2025&#39;, &#39;15/06/2023&#39;, &#39;09/01/2023&#39;, &#39;16/07/2022&#39;,
                  &#39;19/05/2024&#39;, &#39;24/09/2022&#39;, &#39;17/11/2023&#39;, &#39;13/06/2023&#39;],
     &#39;color&#39;: [&#39;red&#39;, &#39;green&#39;, &#39;magenta&#39;, &#39;yellow&#39;, &#39;light_blue&#39;, &#39;dark_blue&#39;, &#39;black&#39;, &#39;white&#39;, &#39;pink&#39;, &#39;orange&#39;,
               &#39;yellow&#39;]})
block.start_date = pd.to_datetime(block.start_date, dayfirst=True, format=&#39;mixed&#39;)
block.end_date = pd.to_datetime(block.end_date, dayfirst=True, format=&#39;mixed&#39;)
block_uniques = block.drop_duplicates([&#39;Identifier&#39;, &#39;start_date&#39;])
for x in date_range:
    temp_df = block_uniques[(block_uniques.start_date &lt;= x) &amp; (block_uniques.end_date &gt;= x)] 
    temp_df.insert(0, &#39;Record_Date&#39;, x)
    frame = pd.concat([frame, temp_df])
frame = frame.sort_values([&#39;Record_Date&#39;, &#39;Identifier&#39;])
frame = frame.reset_index().drop(&#39;index&#39;, axis=1)
print(frame)

Output and solution:

     Record_Date Identifier start_date   end_date   color
0     2021-09-14   4913151F 2020-11-03 2023-07-11     red
1     2021-09-14   99E0A2F8 2020-11-04 2022-07-16   black
2     2021-09-15   4913151F 2020-11-03 2023-07-11     red
3     2021-09-15   99E0A2F8 2020-11-04 2022-07-16   black
4     2021-09-16   4913151F 2020-11-03 2023-07-11     red
...          ...        ...        ...        ...     ...
2641  2023-07-05   D0EED63F 2022-12-27 2023-11-17  orange
2642  2023-07-05   D77E342E 2023-03-13 2024-05-19   white
2643  2023-07-06   4913151F 2020-11-03 2023-07-11     red
2644  2023-07-06   D0EED63F 2022-12-27 2023-11-17  orange
2645  2023-07-06   D77E342E 2023-03-13 2024-05-19   white
[2646 rows x 5 columns]

答案1

得分: 2

看起来像某种形式的不等连接；conditional_join提供了一种高效处理此类情况的方式。请注意，如果您的block中的日期不重叠，那么pd.IntervalIndex是合适且高效的。

请安装开发版本，该版本具有该函数的优化版本：

# pip install pyjanitor
# 安装开发版以获得优化版本
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
# 将date_range转换为命名系列或数据帧
date_range = pd.Series(date_range, name='date')
(block
.conditional_join(
    date_range,
    # 来自左侧的列，
    # 来自右侧的列，
    # 操作符
    ('start_date', 'date', '<='),
    ('end_date', 'date', '>='),
    # 在某些情况下，
    # numba可能会提供性能提升
    use_numba=False,
   )
)
     Identifier start_date   end_date   color       date
0      4913151F 2020-11-03 2023-07-11     红色 2021-09-14
1      4913151F 2020-11-03 2023-07-11     红色 2021-09-15
2      4913151F 2020-11-03 2023-07-11     红色 2021-09-16
3      4913151F 2020-11-03 2023-07-11     红色 2021-09-17
4      4913151F 2020-11-03 2023-07-11     红色 2021-09-18
...         ...        ...        ...     ...        ...
2644   D0C57FCA 2022-06-13 2023-06-13     黄色 2023-06-09
2645   D0C57FCA 2022-06-13 2023-06-13     黄色 2023-06-10
2646   D0C57FCA 2022-06-13 2023-06-13     黄色 2023-06-11
2647   D0C57FCA 2022-06-13 2023-06-13     黄色 2023-06-12
2648   D0C57FCA 2022-06-13 2023-06-13     黄色 2023-06-13
[2649 rows x 5 columns]

英文:

Looks like some form of inequality join; conditional_join offers an efficient way to handle this.
Note that if your dates in block are not overlapping, then pd.IntervalIndex is suitable and performant.

Kindly install the dev version which has an optimised version of the function:

# pip install pyjanitor
# install the dev for an optimised version
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
# convert date_range to either a named series, or a dataframe
date_range = pd.Series(date_range, name = &#39;date&#39;)
(block
.conditional_join(
    date_range, 
    # column from the left,
    # column from the right,
    # operator
    (&#39;start_date&#39;, &#39;date&#39;, &#39;&lt;=&#39;), 
    (&#39;end_date&#39;, &#39;date&#39;, &#39;&gt;=&#39;),
    # in some scenarios,
    # numba might offer a perf boost
    use_numba=False,
   )
)
     Identifier start_date   end_date   color       date
0      4913151F 2020-11-03 2023-07-11     red 2021-09-14
1      4913151F 2020-11-03 2023-07-11     red 2021-09-15
2      4913151F 2020-11-03 2023-07-11     red 2021-09-16
3      4913151F 2020-11-03 2023-07-11     red 2021-09-17
4      4913151F 2020-11-03 2023-07-11     red 2021-09-18
...         ...        ...        ...     ...        ...
2644   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-09
2645   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-10
2646   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-11
2647   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-12
2648   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-13
[2649 rows x 5 columns]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Pandas中优化不等式连接？

问题

答案1

Testing a file download using Selenium – python.

如何获取我下载目录中特定文件的最新版本？

嵌套列表中的元素重复不同次数，为什么更智能的方法更慢？

Running a script after creation of instance in GCP Managed Instance Group

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。