如何在Pandas中优化不等式连接?

huangapple go评论75阅读模式
英文:

How to optimise an inequality join in Pandas?

问题

我已经为您翻译好了代码部分:

我有一个类似于*交叉连接操作*的操作我已经使用for循环实现了它我需要让它变得快速最好还要优雅它根据日期范围条件每天创建一个块条目

这对于**小数据集**效果很好但对于**大数据集**来说完全陷入了非常慢的运行时我知道它可以向量化我的实现非常糟糕

*我已经看过其他帖子*关于如何在数据框中向量化循环我阅读了[10分钟入门pandas][1]如此帖子[如何在Pandas中迭代行][2]所建议尝试使用lambda函数还尝试了Cython我就是无法搞定它

我尝试了实现[pandas.MultiIndex.to_frame][3]我有强烈的感觉这或者它的一个类似物是一个不错的选择我也尝试了许多其他方法但都没有成功

我想要学会优雅地编写代码欢迎提供所有建议解决方案的变化以及评论

```python
from datetime import datetime
import pandas as pd 

beginning = pd.to_datetime('14/09/2021', dayfirst=True)
today = pd.to_datetime(datetime.today())
date_range = pd.date_range(start=beginning, end=today)  # .tolist()

frame = pd.DataFrame(columns=['Record_Date', 'Identifier', 'start_date', 'end_date', 'color'])
block = pd.DataFrame(
    {'Identifier': ['4913151F', 'F4E9124A', '31715888', 'D0C57FCA', '57B4D7EB', 'E46F1E5D', '99E0A2F8', 'D77E342E',
                    'C596D233', 'D0EED63F', 'D0C57FCA'],
     'start_date': ['03/11/2020', '05/07/2022', '22/12/2016', '17/03/2024', '14/10/2022', '08/08/2022', '04/11/2020',
                    '13/03/2023', '05/11/2021', '12/27/2022', '13/06/2022'],
     'end_date': ['11/07/2023', '11/04/2023', '14/12/2018', '20/01/2025', '15/06/2023', '09/01/2023', '16/07/2022',
                  '19/05/2024', '24/09/2022', '17/11/2023', '13/06/2023'],
     'color': ['red', 'green', 'magenta', 'yellow', 'light_blue', 'dark_blue', 'black', 'white', 'pink', 'orange',
               'yellow']})

block.start_date = pd.to_datetime(block.start_date, dayfirst=True, format='mixed')
block.end_date = pd.to_datetime(block.end_date, dayfirst=True, format='mixed')
block_uniques = block.drop_duplicates(['Identifier', 'start_date'])

for x in date_range:
    temp_df = block_uniques[(block_uniques.start_date <= x) & (block_uniques.end_date >= x)] 
    temp_df.insert(0, 'Record_Date', x)
    frame = pd.concat([frame, temp_df])
frame = frame.sort_values(['Record_Date', 'Identifier'])
frame = frame.reset_index().drop('index', axis=1)

print(frame)

输出和解决方案:

     Record_Date Identifier start_date   end_date   color
0     2021-09-14   4913151F 2020-11-03 2023-07-11     red
1     2021-09-14   99E0A2F8 2020-11-04 2022-07-16   black
2     2021-09-15   4913151F 2020-11-03 2023-07-11     red
3     2021-09-15   99E0A2F8 2020-11-04 2022-07-16   black
4     2021-09-16   4913151F 2020-11-03 2023-07-11     red
...          ...        ...        ...        ...     ...
2641  2023-07-05   D0EED63F 2022-12-27 2023-11-17  orange
2642  2023-07-05   D77E342E 2023-03-13 2024-05-19   white
2643  2023-07-06   4913151F 2020-11-03 2023-07-11     red
2644  2023-07-06   D0EED63F 2022-12-27 2023-11-17  orange
2645  2023-07-06   D77E342E 2023-03-13 2024-05-19   white

[2646 rows x 5 columns]

希望这有助于您的问题。

英文:

I have a cross-join-like operation that I have implemented using a for loop. I need to make it fast and preferably elegant. It creates a block entry per day with a date range condition.

This works fine for small datasets but completely stalls into a very slow runtime for larger datasets. I know that it can be vectorized. My implementation is very bad.

I have looked at the other posts on how to vectorize loops in DataFrames. I read 10 minutes to pandas as per suggested by this post How to iterate over rows in a DataFrame in Pandas, tried using lambda functions. Messed with Cython. I just can't get it.

I tried implementing [pandas.MultiIndex.to_frame] and I have a strong feeling this, or one of it's cousins, is a good way to go. I have also tried a bunch of other things and nothing.

I want to learn to code elegantly. All suggestions, variations on the solution and, comments are welcome.

from datetime import datetime
import pandas as pd 

beginning = pd.to_datetime(&#39;14/09/2021&#39;, dayfirst=True)
today = pd.to_datetime(datetime.today())
date_range = pd.date_range(start=beginning, end=today)  # .tolist()

frame = pd.DataFrame(columns=[&#39;Record_Date&#39;, &#39;Identifier&#39;, &#39;start_date&#39;, &#39;end_date&#39;, &#39;color&#39;])
block = pd.DataFrame(
    {&#39;Identifier&#39;: [&#39;4913151F&#39;, &#39;F4E9124A&#39;, &#39;31715888&#39;, &#39;D0C57FCA&#39;, &#39;57B4D7EB&#39;, &#39;E46F1E5D&#39;, &#39;99E0A2F8&#39;, &#39;D77E342E&#39;,
                    &#39;C596D233&#39;, &#39;D0EED63F&#39;, &#39;D0C57FCA&#39;],
     &#39;start_date&#39;: [&#39;03/11/2020&#39;, &#39;05/07/2022&#39;, &#39;22/12/2016&#39;, &#39;17/03/2024&#39;, &#39;14/10/2022&#39;, &#39;08/08/2022&#39;, &#39;04/11/2020&#39;,
                    &#39;13/03/2023&#39;, &#39;05/11/2021&#39;, &#39;12/27/2022&#39;, &#39;13/06/2022&#39;],
     &#39;end_date&#39;: [&#39;11/07/2023&#39;, &#39;11/04/2023&#39;, &#39;14/12/2018&#39;, &#39;20/01/2025&#39;, &#39;15/06/2023&#39;, &#39;09/01/2023&#39;, &#39;16/07/2022&#39;,
                  &#39;19/05/2024&#39;, &#39;24/09/2022&#39;, &#39;17/11/2023&#39;, &#39;13/06/2023&#39;],
     &#39;color&#39;: [&#39;red&#39;, &#39;green&#39;, &#39;magenta&#39;, &#39;yellow&#39;, &#39;light_blue&#39;, &#39;dark_blue&#39;, &#39;black&#39;, &#39;white&#39;, &#39;pink&#39;, &#39;orange&#39;,
               &#39;yellow&#39;]})

block.start_date = pd.to_datetime(block.start_date, dayfirst=True, format=&#39;mixed&#39;)
block.end_date = pd.to_datetime(block.end_date, dayfirst=True, format=&#39;mixed&#39;)
block_uniques = block.drop_duplicates([&#39;Identifier&#39;, &#39;start_date&#39;])

for x in date_range:
    temp_df = block_uniques[(block_uniques.start_date &lt;= x) &amp; (block_uniques.end_date &gt;= x)] 
    temp_df.insert(0, &#39;Record_Date&#39;, x)
    frame = pd.concat([frame, temp_df])
frame = frame.sort_values([&#39;Record_Date&#39;, &#39;Identifier&#39;])
frame = frame.reset_index().drop(&#39;index&#39;, axis=1)

print(frame)

Output and solution:

     Record_Date Identifier start_date   end_date   color
0     2021-09-14   4913151F 2020-11-03 2023-07-11     red
1     2021-09-14   99E0A2F8 2020-11-04 2022-07-16   black
2     2021-09-15   4913151F 2020-11-03 2023-07-11     red
3     2021-09-15   99E0A2F8 2020-11-04 2022-07-16   black
4     2021-09-16   4913151F 2020-11-03 2023-07-11     red
...          ...        ...        ...        ...     ...
2641  2023-07-05   D0EED63F 2022-12-27 2023-11-17  orange
2642  2023-07-05   D77E342E 2023-03-13 2024-05-19   white
2643  2023-07-06   4913151F 2020-11-03 2023-07-11     red
2644  2023-07-06   D0EED63F 2022-12-27 2023-11-17  orange
2645  2023-07-06   D77E342E 2023-03-13 2024-05-19   white
[2646 rows x 5 columns]

答案1

得分: 2

看起来像某种形式的不等连接;conditional_join提供了一种高效处理此类情况的方式。请注意,如果您的block中的日期不重叠,那么pd.IntervalIndex是合适且高效的。

请安装开发版本,该版本具有该函数的优化版本:

# pip install pyjanitor
# 安装开发版以获得优化版本
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd

# 将date_range转换为命名系列或数据帧
date_range = pd.Series(date_range, name='date')

(block
.conditional_join(
    date_range,
    # 来自左侧的列,
    # 来自右侧的列,
    # 操作符
    ('start_date', 'date', '<='),
    ('end_date', 'date', '>='),
    # 在某些情况下,
    # numba可能会提供性能提升
    use_numba=False,
   )
)
     Identifier start_date   end_date   color       date
0      4913151F 2020-11-03 2023-07-11     红色 2021-09-14
1      4913151F 2020-11-03 2023-07-11     红色 2021-09-15
2      4913151F 2020-11-03 2023-07-11     红色 2021-09-16
3      4913151F 2020-11-03 2023-07-11     红色 2021-09-17
4      4913151F 2020-11-03 2023-07-11     红色 2021-09-18
...         ...        ...        ...     ...        ...
2644   D0C57FCA 2022-06-13 2023-06-13     黄色 2023-06-09
2645   D0C57FCA 2022-06-13 2023-06-13     黄色 2023-06-10
2646   D0C57FCA 2022-06-13 2023-06-13     黄色 2023-06-11
2647   D0C57FCA 2022-06-13 2023-06-13     黄色 2023-06-12
2648   D0C57FCA 2022-06-13 2023-06-13     黄色 2023-06-13

[2649 rows x 5 columns]
英文:

Looks like some form of inequality join; conditional_join offers an efficient way to handle this.
Note that if your dates in block are not overlapping, then pd.IntervalIndex is suitable and performant.

Kindly install the dev version which has an optimised version of the function:

# pip install pyjanitor
# install the dev for an optimised version
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd

# convert date_range to either a named series, or a dataframe
date_range = pd.Series(date_range, name = &#39;date&#39;)

(block
.conditional_join(
    date_range, 
    # column from the left,
    # column from the right,
    # operator
    (&#39;start_date&#39;, &#39;date&#39;, &#39;&lt;=&#39;), 
    (&#39;end_date&#39;, &#39;date&#39;, &#39;&gt;=&#39;),
    # in some scenarios,
    # numba might offer a perf boost
    use_numba=False,
   )
)
     Identifier start_date   end_date   color       date
0      4913151F 2020-11-03 2023-07-11     red 2021-09-14
1      4913151F 2020-11-03 2023-07-11     red 2021-09-15
2      4913151F 2020-11-03 2023-07-11     red 2021-09-16
3      4913151F 2020-11-03 2023-07-11     red 2021-09-17
4      4913151F 2020-11-03 2023-07-11     red 2021-09-18
...         ...        ...        ...     ...        ...
2644   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-09
2645   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-10
2646   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-11
2647   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-12
2648   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-13

[2649 rows x 5 columns]

huangapple
  • 本文由 发表于 2023年7月7日 06:01:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/76632765.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定