英文:
Pandas merge on date range on multiple values
问题
我理解你的问题,你想知道如何在两个数据框之间高效地计算在开始和结束日期之间(包括起始和结束日期)有多少个假期。你之前的解决方案效率较低。以下是一种更高效的方法:
首先,你可以使用numpy
的向量化操作来计算日期范围内的假期数量,然后将结果添加到df1
中。这样可以避免使用apply
函数,提高效率。
import pandas as pd
import numpy as np
from datetime import datetime
# 创建示例数据框
df1 = pd.DataFrame({"Event": ["S1", "K1", "S2", "S3", "A1"],
"Start": [datetime(2022,1,4), datetime(2022,1,15), datetime(2022,9,12), datetime(2022,11,11), datetime(2022,5,29)],
"End": [datetime(2022,1,19), datetime(2022,1, 29), datetime(2022,9,27), datetime(2022,11,22), datetime(2022,6,15)]
})
df2 = pd.DataFrame({"Holidays": [datetime(2022,1,1), datetime(2022,1,6), datetime(2022,1,13)]})
# 使用numpy向量化操作计算每个事件的假期数量
df1['holiday_count'] = df1.apply(lambda x: np.sum((x['Start'] <= df2['Holidays']) & (df2['Holidays'] <= x['End'])), axis=1)
# 显示结果
print(df1)
这个方法使用了向量化操作,可以显著提高处理效率,特别是在处理大型数据集时。希望这可以帮助你解决问题。
英文:
I have dataframe of events with start and end dates like this:
import pandas as pd
from datetime import datetime
df1 = pd.DataFrame({"Event": ["S1", "K1", "S2", "S3", "A1"],
"Start": [datetime(2022,1,4), datetime(2022,1,15), datetime(2022,9,12), datetime(2022,11,11), datetime(2022,5,29)],
"End": [datetime(2022,1,19), datetime(2022,1, 29), datetime(2022,9,27), datetime(2022,11,22), datetime(2022,6,15)]
})
Note: The "Event"
column may not have unique values.
I have another dataframe which contains all the holidays:
df2 = pd.DataFrame({"Holidays": [datetime(2022,1,1), datetime(2022,1,6), datetime(2022,1,13), ....]})
I want to know for every event how many holidays are there in between the start and end date both inclusive. My solution:
df['holiday_count'] = df.apply(lambda x: len(set(pd.date_range(x['Start'], x['End'])).intersection(set(holidays['Holidays']))), axis=1)
I realize that my solution is quite inefficient for large dataset of df1
. Here are a few things which I tried:
- Since, it is not an exact match,
df1.merge
wouldn't help. - I tried using
pd.merge_asof
, however, the joins count only to 1. Over here, the start and end period may contain multiple holidays or no holidays as well. - I tried using
pd.IntervalIndex
. The issue over there I faced isKeyError
for those ranges where there were no holidays. cross
merge followed by filter is one option, but I think, it'd have a high memory imprint which I want to avoid.- Although didn't try, but people were suggesting to use
pandas_sql
. However, there were comments stating it is slow method.
These trials were based on several stackoverflow questions in the past like:
- https://stackoverflow.com/questions/44367672/best-way-to-join-merge-by-range-in-pandas
- https://stackoverflow.com/questions/46179362/fastest-way-to-merge-pandas-dataframe-on-ranges
- https://stackoverflow.com/questions/30627968/merge-pandas-dataframes-where-one-value-is-between-two-others
- https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range
答案1
得分: 3
你可以尝试这个 [tag:numpy] 方法:
sdates, edates = df1["Start"].values, df1["End"].values
hdates = df2["Holidays"].values[:, None]
df1["holiday_count"] = np.sum((hdates >= sdates) & (hdates <= end_dates), axis=0)
# 801 微秒 ± 45.7 微秒每次循环(7 次运行的平均值 ± 标准差,1,000 次循环每次)
输出:
print(df1)
Event Start End holiday_count
0 S1 2022-01-04 2022-01-19 2
1 K1 2022-01-15 2022-01-29 0
2 S2 2022-09-12 2022-09-27 0
3 S3 2022-11-11 2022-11-22 0
4 A1 2022-05-29 2022-06-15 0
英文:
You can try this [tag:numpy] approach :
sdates, edates = df1["Start"].values, df1["End"].values
hdates = df2["Holidays"].values[:, None]
df1["holiday_count"] = np.sum((hdates >= sdates) & (hdates <= end_dates), axis=0)
# 801 µs ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Output :
print(df1)
Event Start End holiday_count
0 S1 2022-01-04 2022-01-19 2
1 K1 2022-01-15 2022-01-29 0
2 S2 2022-09-12 2022-09-27 0
3 S3 2022-11-11 2022-11-22 0
4 A1 2022-05-29 2022-06-15 0
答案2
得分: 0
我建议这是你问题的一个替代解决方案
# 计算两个日期之间(包括起始日期和结束日期)的假期数量的函数
def count_holidays(start_date, end_date, holidays):
return sum(start_date <= holiday <= end_date for holiday in holidays)
# 在"Event"上合并df1和df2
merged_df = pd.merge(df1, df2, left_on="Event", right_on="Event", how="left")
# 为每个事件计算假期数量
merged_df["HolidaysCount"] = merged_df.apply(lambda row: count_holidays(row["Start"], row["End"], row["Holidays"]), axis=1)
# 删除"Holidays"列
merged_df.drop("Holidays", axis=1, inplace=True)
print(merged_df)
英文:
I would also suggest this is an alternative solution to your question
# Function to count the number of holidays between two dates (inclusive)
def count_holidays(start_date, end_date, holidays):
return sum(start_date <= holiday <= end_date for holiday in holidays)
# Merge df1 and df2 on Event
merged_df = pd.merge(df1, df2, left_on="Event", right_on="Event", how="left")
# Count holidays for each event
merged_df["HolidaysCount"] = merged_df.apply(lambda row: count_holidays(row["Start"], row["End"], row["Holidays"]), axis=1)
# Drop the Holidays column
merged_df.drop("Holidays", axis=1, inplace=True)
print(merged_df)
答案3
得分: 0
这是一个不等式连接,可以使用conditional_join来高效解决:
# pip install pyjanitor
# 为了更好的性能,如果可能的话,
# 安装开发版本:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df1
.conditional_join(
df2,
('Start', 'Holidays', '<='),
('End', 'Holidays', '>='),
how='left')
.groupby(df1.columns.tolist(), sort=False)
.count()
)
Holidays
Event Start End
S1 2022-01-04 2022-01-19 2
K1 2022-01-15 2022-01-29 0
S2 2022-09-12 2022-09-27 0
S3 2022-11-11 2022-11-22 0
A1 2022-05-29 2022-06-15 0
在内部,它使用二分搜索而不是笛卡尔连接。对于大型数据,这提供了更高的性能/效率。
英文:
This is an inequality join, which is solved efficiently with conditional_join:
# pip install pyjanitor
# for better performance, if you can
# install the dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df1
.conditional_join(
df2,
('Start', 'Holidays', '<='),
('End', 'Holidays', '>='),
how = 'left')
.groupby(df1.columns.tolist(), sort = False)
.count()
)
Holidays
Event Start End
S1 2022-01-04 2022-01-19 2
K1 2022-01-15 2022-01-29 0
S2 2022-09-12 2022-09-27 0
S3 2022-11-11 2022-11-22 0
A1 2022-05-29 2022-06-15 0
Under the hood, it uses binary search, instead of a cartesian join. For large data, this offers more performance/efficiency.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论