2023年6月12日 18:15:39go评论95阅读模式

英文:

Pandas merge on date range on multiple values

问题

我理解你的问题，你想知道如何在两个数据框之间高效地计算在开始和结束日期之间（包括起始和结束日期）有多少个假期。你之前的解决方案效率较低。以下是一种更高效的方法：

首先，你可以使用numpy的向量化操作来计算日期范围内的假期数量，然后将结果添加到df1中。这样可以避免使用apply函数，提高效率。

import pandas as pd
import numpy as np
from datetime import datetime
# 创建示例数据框
df1 = pd.DataFrame({"Event": ["S1", "K1", "S2", "S3", "A1"],
                    "Start": [datetime(2022,1,4), datetime(2022,1,15), datetime(2022,9,12), datetime(2022,11,11), datetime(2022,5,29)],
                    "End": [datetime(2022,1,19), datetime(2022,1, 29), datetime(2022,9,27), datetime(2022,11,22), datetime(2022,6,15)]
                   })
df2 = pd.DataFrame({"Holidays": [datetime(2022,1,1), datetime(2022,1,6), datetime(2022,1,13)]})
# 使用numpy向量化操作计算每个事件的假期数量
df1['holiday_count'] = df1.apply(lambda x: np.sum((x['Start'] <= df2['Holidays']) & (df2['Holidays'] <= x['End'])), axis=1)
# 显示结果
print(df1)

这个方法使用了向量化操作，可以显著提高处理效率，特别是在处理大型数据集时。希望这可以帮助你解决问题。

英文:

I have dataframe of events with start and end dates like this:

import pandas as pd
from datetime import datetime
df1 = pd.DataFrame({&quot;Event&quot;: [&quot;S1&quot;, &quot;K1&quot;, &quot;S2&quot;, &quot;S3&quot;, &quot;A1&quot;],
                    &quot;Start&quot;: [datetime(2022,1,4), datetime(2022,1,15), datetime(2022,9,12), datetime(2022,11,11), datetime(2022,5,29)],
                    &quot;End&quot;: [datetime(2022,1,19), datetime(2022,1, 29), datetime(2022,9,27), datetime(2022,11,22), datetime(2022,6,15)]
                   })

Note: The "Event" column may not have unique values.

I have another dataframe which contains all the holidays:

df2 = pd.DataFrame({&quot;Holidays&quot;: [datetime(2022,1,1), datetime(2022,1,6), datetime(2022,1,13), ....]})

I want to know for every event how many holidays are there in between the start and end date both inclusive. My solution:

df[&#39;holiday_count&#39;] = df.apply(lambda x: len(set(pd.date_range(x[&#39;Start&#39;], x[&#39;End&#39;])).intersection(set(holidays[&#39;Holidays&#39;]))), axis=1)

I realize that my solution is quite inefficient for large dataset of df1. Here are a few things which I tried:

Since, it is not an exact match, df1.merge wouldn't help.
I tried using pd.merge_asof, however, the joins count only to 1. Over here, the start and end period may contain multiple holidays or no holidays as well.
I tried using pd.IntervalIndex. The issue over there I faced is KeyError for those ranges where there were no holidays.
cross merge followed by filter is one option, but I think, it'd have a high memory imprint which I want to avoid.
Although didn't try, but people were suggesting to use pandas_sql. However, there were comments stating it is slow method.

These trials were based on several stackoverflow questions in the past like:

答案1

得分: 3

你可以尝试这个 [tag:numpy] 方法：

sdates, edates = df1["Start"].values, df1["End"].values
hdates = df2["Holidays"].values[:, None]
df1["holiday_count"] = np.sum((hdates >= sdates) & (hdates <= end_dates), axis=0)
# 801 微秒 ± 45.7 微秒每次循环（7 次运行的平均值 ± 标准差，1,000 次循环每次）

输出：

print(df1)
  Event      Start        End  holiday_count
0    S1 2022-01-04 2022-01-19              2
1    K1 2022-01-15 2022-01-29              0
2    S2 2022-09-12 2022-09-27              0
3    S3 2022-11-11 2022-11-22              0
4    A1 2022-05-29 2022-06-15              0

英文:

You can try this [tag:numpy] approach :

sdates, edates = df1[&quot;Start&quot;].values, df1[&quot;End&quot;].values
hdates = df2[&quot;Holidays&quot;].values[:, None]
df1[&quot;holiday_count&quot;] = np.sum((hdates &gt;= sdates) &amp; (hdates &lt;= end_dates), axis=0)
# 801 &#181;s &#177; 45.7 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1,000 loops each)

Output :

print(df1)
  Event      Start        End  holiday_count
0    S1 2022-01-04 2022-01-19              2
1    K1 2022-01-15 2022-01-29              0
2    S2 2022-09-12 2022-09-27              0
3    S3 2022-11-11 2022-11-22              0
4    A1 2022-05-29 2022-06-15              0

答案2

得分: 0

我建议这是你问题的一个替代解决方案

# 计算两个日期之间（包括起始日期和结束日期）的假期数量的函数
def count_holidays(start_date, end_date, holidays):
    return sum(start_date <= holiday <= end_date for holiday in holidays)
# 在"Event"上合并df1和df2
merged_df = pd.merge(df1, df2, left_on="Event", right_on="Event", how="left")
# 为每个事件计算假期数量
merged_df["HolidaysCount"] = merged_df.apply(lambda row: count_holidays(row["Start"], row["End"], row["Holidays"]), axis=1)
# 删除"Holidays"列
merged_df.drop("Holidays", axis=1, inplace=True)
print(merged_df)

英文:

I would also suggest this is an alternative solution to your question

 # Function to count the number of holidays between two dates (inclusive)
def count_holidays(start_date, end_date, holidays):
    return sum(start_date &lt;= holiday &lt;= end_date for holiday in holidays)
# Merge df1 and df2 on Event
merged_df = pd.merge(df1, df2, left_on=&quot;Event&quot;, right_on=&quot;Event&quot;, how=&quot;left&quot;)
# Count holidays for each event
merged_df[&quot;HolidaysCount&quot;] = merged_df.apply(lambda row: count_holidays(row[&quot;Start&quot;], row[&quot;End&quot;], row[&quot;Holidays&quot;]), axis=1)
# Drop the Holidays column
merged_df.drop(&quot;Holidays&quot;, axis=1, inplace=True)
print(merged_df)

答案3

得分: 0

这是一个不等式连接，可以使用conditional_join来高效解决：

# pip install pyjanitor
# 为了更好的性能，如果可能的话，
# 安装开发版本：
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df1
.conditional_join(
    df2, 
    ('Start', 'Holidays', '<='), 
    ('End', 'Holidays', '>='), 
    how='left')
.groupby(df1.columns.tolist(), sort=False)
.count()
)
                             Holidays
Event Start      End                 
S1    2022-01-04 2022-01-19         2
K1    2022-01-15 2022-01-29         0
S2    2022-09-12 2022-09-27         0
S3    2022-11-11 2022-11-22         0
A1    2022-05-29 2022-06-15         0

在内部，它使用二分搜索而不是笛卡尔连接。对于大型数据，这提供了更高的性能/效率。

英文:

This is an inequality join, which is solved efficiently with conditional_join:

# pip install pyjanitor
# for better performance, if you can
# install the dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df1
.conditional_join(
    df2, 
    (&#39;Start&#39;, &#39;Holidays&#39;, &#39;&lt;=&#39;), 
    (&#39;End&#39;, &#39;Holidays&#39;, &#39;&gt;=&#39;), 
    how = &#39;left&#39;)
.groupby(df1.columns.tolist(), sort = False)
.count()
)
                             Holidays
Event Start      End                 
S1    2022-01-04 2022-01-19         2
K1    2022-01-15 2022-01-29         0
S2    2022-09-12 2022-09-27         0
S3    2022-11-11 2022-11-22         0
A1    2022-05-29 2022-06-15         0

Under the hood, it uses binary search, instead of a cartesian join. For large data, this offers more performance/efficiency.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas多值上的日期范围合并

问题

答案1

答案2

答案3

created dataframe using R function, but want the results stored as separate variables instead of just printed

如何根据像素颜色值（R、G、B）创建图像

“Pycharm Linux ‘cannot open Local Terminal'”

Python GET请求与CURL输出不匹配

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。