如何在Pandas中优化不等式连接?

huangapple go评论111阅读模式
英文:

How to optimise an inequality join in Pandas?

问题

我已经为您翻译好了代码部分:

  1. 我有一个类似于*交叉连接操作*的操作我已经使用for循环实现了它我需要让它变得快速最好还要优雅它根据日期范围条件每天创建一个块条目
  2. 这对于**小数据集**效果很好但对于**大数据集**来说完全陷入了非常慢的运行时我知道它可以向量化我的实现非常糟糕
  3. *我已经看过其他帖子*关于如何在数据框中向量化循环我阅读了[10分钟入门pandas][1]如此帖子[如何在Pandas中迭代行][2]所建议尝试使用lambda函数还尝试了Cython我就是无法搞定它
  4. 我尝试了实现[pandas.MultiIndex.to_frame][3]我有强烈的感觉这或者它的一个类似物是一个不错的选择我也尝试了许多其他方法但都没有成功
  5. 我想要学会优雅地编写代码欢迎提供所有建议解决方案的变化以及评论
  6. ```python
  7. from datetime import datetime
  8. import pandas as pd
  9. beginning = pd.to_datetime('14/09/2021', dayfirst=True)
  10. today = pd.to_datetime(datetime.today())
  11. date_range = pd.date_range(start=beginning, end=today) # .tolist()
  12. frame = pd.DataFrame(columns=['Record_Date', 'Identifier', 'start_date', 'end_date', 'color'])
  13. block = pd.DataFrame(
  14. {'Identifier': ['4913151F', 'F4E9124A', '31715888', 'D0C57FCA', '57B4D7EB', 'E46F1E5D', '99E0A2F8', 'D77E342E',
  15. 'C596D233', 'D0EED63F', 'D0C57FCA'],
  16. 'start_date': ['03/11/2020', '05/07/2022', '22/12/2016', '17/03/2024', '14/10/2022', '08/08/2022', '04/11/2020',
  17. '13/03/2023', '05/11/2021', '12/27/2022', '13/06/2022'],
  18. 'end_date': ['11/07/2023', '11/04/2023', '14/12/2018', '20/01/2025', '15/06/2023', '09/01/2023', '16/07/2022',
  19. '19/05/2024', '24/09/2022', '17/11/2023', '13/06/2023'],
  20. 'color': ['red', 'green', 'magenta', 'yellow', 'light_blue', 'dark_blue', 'black', 'white', 'pink', 'orange',
  21. 'yellow']})
  22. block.start_date = pd.to_datetime(block.start_date, dayfirst=True, format='mixed')
  23. block.end_date = pd.to_datetime(block.end_date, dayfirst=True, format='mixed')
  24. block_uniques = block.drop_duplicates(['Identifier', 'start_date'])
  25. for x in date_range:
  26. temp_df = block_uniques[(block_uniques.start_date <= x) & (block_uniques.end_date >= x)]
  27. temp_df.insert(0, 'Record_Date', x)
  28. frame = pd.concat([frame, temp_df])
  29. frame = frame.sort_values(['Record_Date', 'Identifier'])
  30. frame = frame.reset_index().drop('index', axis=1)
  31. print(frame)

输出和解决方案:

  1. Record_Date Identifier start_date end_date color
  2. 0 2021-09-14 4913151F 2020-11-03 2023-07-11 red
  3. 1 2021-09-14 99E0A2F8 2020-11-04 2022-07-16 black
  4. 2 2021-09-15 4913151F 2020-11-03 2023-07-11 red
  5. 3 2021-09-15 99E0A2F8 2020-11-04 2022-07-16 black
  6. 4 2021-09-16 4913151F 2020-11-03 2023-07-11 red
  7. ... ... ... ... ... ...
  8. 2641 2023-07-05 D0EED63F 2022-12-27 2023-11-17 orange
  9. 2642 2023-07-05 D77E342E 2023-03-13 2024-05-19 white
  10. 2643 2023-07-06 4913151F 2020-11-03 2023-07-11 red
  11. 2644 2023-07-06 D0EED63F 2022-12-27 2023-11-17 orange
  12. 2645 2023-07-06 D77E342E 2023-03-13 2024-05-19 white
  13. [2646 rows x 5 columns]

希望这有助于您的问题。

英文:

I have a cross-join-like operation that I have implemented using a for loop. I need to make it fast and preferably elegant. It creates a block entry per day with a date range condition.

This works fine for small datasets but completely stalls into a very slow runtime for larger datasets. I know that it can be vectorized. My implementation is very bad.

I have looked at the other posts on how to vectorize loops in DataFrames. I read 10 minutes to pandas as per suggested by this post How to iterate over rows in a DataFrame in Pandas, tried using lambda functions. Messed with Cython. I just can't get it.

I tried implementing [pandas.MultiIndex.to_frame] and I have a strong feeling this, or one of it's cousins, is a good way to go. I have also tried a bunch of other things and nothing.

I want to learn to code elegantly. All suggestions, variations on the solution and, comments are welcome.

  1. from datetime import datetime
  2. import pandas as pd
  3. beginning = pd.to_datetime(&#39;14/09/2021&#39;, dayfirst=True)
  4. today = pd.to_datetime(datetime.today())
  5. date_range = pd.date_range(start=beginning, end=today) # .tolist()
  6. frame = pd.DataFrame(columns=[&#39;Record_Date&#39;, &#39;Identifier&#39;, &#39;start_date&#39;, &#39;end_date&#39;, &#39;color&#39;])
  7. block = pd.DataFrame(
  8. {&#39;Identifier&#39;: [&#39;4913151F&#39;, &#39;F4E9124A&#39;, &#39;31715888&#39;, &#39;D0C57FCA&#39;, &#39;57B4D7EB&#39;, &#39;E46F1E5D&#39;, &#39;99E0A2F8&#39;, &#39;D77E342E&#39;,
  9. &#39;C596D233&#39;, &#39;D0EED63F&#39;, &#39;D0C57FCA&#39;],
  10. &#39;start_date&#39;: [&#39;03/11/2020&#39;, &#39;05/07/2022&#39;, &#39;22/12/2016&#39;, &#39;17/03/2024&#39;, &#39;14/10/2022&#39;, &#39;08/08/2022&#39;, &#39;04/11/2020&#39;,
  11. &#39;13/03/2023&#39;, &#39;05/11/2021&#39;, &#39;12/27/2022&#39;, &#39;13/06/2022&#39;],
  12. &#39;end_date&#39;: [&#39;11/07/2023&#39;, &#39;11/04/2023&#39;, &#39;14/12/2018&#39;, &#39;20/01/2025&#39;, &#39;15/06/2023&#39;, &#39;09/01/2023&#39;, &#39;16/07/2022&#39;,
  13. &#39;19/05/2024&#39;, &#39;24/09/2022&#39;, &#39;17/11/2023&#39;, &#39;13/06/2023&#39;],
  14. &#39;color&#39;: [&#39;red&#39;, &#39;green&#39;, &#39;magenta&#39;, &#39;yellow&#39;, &#39;light_blue&#39;, &#39;dark_blue&#39;, &#39;black&#39;, &#39;white&#39;, &#39;pink&#39;, &#39;orange&#39;,
  15. &#39;yellow&#39;]})
  16. block.start_date = pd.to_datetime(block.start_date, dayfirst=True, format=&#39;mixed&#39;)
  17. block.end_date = pd.to_datetime(block.end_date, dayfirst=True, format=&#39;mixed&#39;)
  18. block_uniques = block.drop_duplicates([&#39;Identifier&#39;, &#39;start_date&#39;])
  19. for x in date_range:
  20. temp_df = block_uniques[(block_uniques.start_date &lt;= x) &amp; (block_uniques.end_date &gt;= x)]
  21. temp_df.insert(0, &#39;Record_Date&#39;, x)
  22. frame = pd.concat([frame, temp_df])
  23. frame = frame.sort_values([&#39;Record_Date&#39;, &#39;Identifier&#39;])
  24. frame = frame.reset_index().drop(&#39;index&#39;, axis=1)
  25. print(frame)

Output and solution:

  1. Record_Date Identifier start_date end_date color
  2. 0 2021-09-14 4913151F 2020-11-03 2023-07-11 red
  3. 1 2021-09-14 99E0A2F8 2020-11-04 2022-07-16 black
  4. 2 2021-09-15 4913151F 2020-11-03 2023-07-11 red
  5. 3 2021-09-15 99E0A2F8 2020-11-04 2022-07-16 black
  6. 4 2021-09-16 4913151F 2020-11-03 2023-07-11 red
  7. ... ... ... ... ... ...
  8. 2641 2023-07-05 D0EED63F 2022-12-27 2023-11-17 orange
  9. 2642 2023-07-05 D77E342E 2023-03-13 2024-05-19 white
  10. 2643 2023-07-06 4913151F 2020-11-03 2023-07-11 red
  11. 2644 2023-07-06 D0EED63F 2022-12-27 2023-11-17 orange
  12. 2645 2023-07-06 D77E342E 2023-03-13 2024-05-19 white
  13. [2646 rows x 5 columns]

答案1

得分: 2

看起来像某种形式的不等连接;conditional_join提供了一种高效处理此类情况的方式。请注意,如果您的block中的日期不重叠,那么pd.IntervalIndex是合适且高效的。

请安装开发版本,该版本具有该函数的优化版本:

  1. # pip install pyjanitor
  2. # 安装开发版以获得优化版本
  3. # pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
  4. import janitor
  5. import pandas as pd
  6. # 将date_range转换为命名系列或数据帧
  7. date_range = pd.Series(date_range, name='date')
  8. (block
  9. .conditional_join(
  10. date_range,
  11. # 来自左侧的列,
  12. # 来自右侧的列,
  13. # 操作符
  14. ('start_date', 'date', '<='),
  15. ('end_date', 'date', '>='),
  16. # 在某些情况下,
  17. # numba可能会提供性能提升
  18. use_numba=False,
  19. )
  20. )
  21. Identifier start_date end_date color date
  22. 0 4913151F 2020-11-03 2023-07-11 红色 2021-09-14
  23. 1 4913151F 2020-11-03 2023-07-11 红色 2021-09-15
  24. 2 4913151F 2020-11-03 2023-07-11 红色 2021-09-16
  25. 3 4913151F 2020-11-03 2023-07-11 红色 2021-09-17
  26. 4 4913151F 2020-11-03 2023-07-11 红色 2021-09-18
  27. ... ... ... ... ... ...
  28. 2644 D0C57FCA 2022-06-13 2023-06-13 黄色 2023-06-09
  29. 2645 D0C57FCA 2022-06-13 2023-06-13 黄色 2023-06-10
  30. 2646 D0C57FCA 2022-06-13 2023-06-13 黄色 2023-06-11
  31. 2647 D0C57FCA 2022-06-13 2023-06-13 黄色 2023-06-12
  32. 2648 D0C57FCA 2022-06-13 2023-06-13 黄色 2023-06-13
  33. [2649 rows x 5 columns]
英文:

Looks like some form of inequality join; conditional_join offers an efficient way to handle this.
Note that if your dates in block are not overlapping, then pd.IntervalIndex is suitable and performant.

Kindly install the dev version which has an optimised version of the function:

  1. # pip install pyjanitor
  2. # install the dev for an optimised version
  3. # pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
  4. import janitor
  5. import pandas as pd
  6. # convert date_range to either a named series, or a dataframe
  7. date_range = pd.Series(date_range, name = &#39;date&#39;)
  8. (block
  9. .conditional_join(
  10. date_range,
  11. # column from the left,
  12. # column from the right,
  13. # operator
  14. (&#39;start_date&#39;, &#39;date&#39;, &#39;&lt;=&#39;),
  15. (&#39;end_date&#39;, &#39;date&#39;, &#39;&gt;=&#39;),
  16. # in some scenarios,
  17. # numba might offer a perf boost
  18. use_numba=False,
  19. )
  20. )
  21. Identifier start_date end_date color date
  22. 0 4913151F 2020-11-03 2023-07-11 red 2021-09-14
  23. 1 4913151F 2020-11-03 2023-07-11 red 2021-09-15
  24. 2 4913151F 2020-11-03 2023-07-11 red 2021-09-16
  25. 3 4913151F 2020-11-03 2023-07-11 red 2021-09-17
  26. 4 4913151F 2020-11-03 2023-07-11 red 2021-09-18
  27. ... ... ... ... ... ...
  28. 2644 D0C57FCA 2022-06-13 2023-06-13 yellow 2023-06-09
  29. 2645 D0C57FCA 2022-06-13 2023-06-13 yellow 2023-06-10
  30. 2646 D0C57FCA 2022-06-13 2023-06-13 yellow 2023-06-11
  31. 2647 D0C57FCA 2022-06-13 2023-06-13 yellow 2023-06-12
  32. 2648 D0C57FCA 2022-06-13 2023-06-13 yellow 2023-06-13
  33. [2649 rows x 5 columns]

huangapple
  • 本文由 发表于 2023年7月7日 06:01:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/76632765.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定