Pandas多值上的日期范围合并

huangapple go评论62阅读模式
英文:

Pandas merge on date range on multiple values

问题

我理解你的问题,你想知道如何在两个数据框之间高效地计算在开始和结束日期之间(包括起始和结束日期)有多少个假期。你之前的解决方案效率较低。以下是一种更高效的方法:

首先,你可以使用numpy的向量化操作来计算日期范围内的假期数量,然后将结果添加到df1中。这样可以避免使用apply函数,提高效率。

import pandas as pd
import numpy as np
from datetime import datetime

# 创建示例数据框
df1 = pd.DataFrame({"Event": ["S1", "K1", "S2", "S3", "A1"],
                    "Start": [datetime(2022,1,4), datetime(2022,1,15), datetime(2022,9,12), datetime(2022,11,11), datetime(2022,5,29)],
                    "End": [datetime(2022,1,19), datetime(2022,1, 29), datetime(2022,9,27), datetime(2022,11,22), datetime(2022,6,15)]
                   })

df2 = pd.DataFrame({"Holidays": [datetime(2022,1,1), datetime(2022,1,6), datetime(2022,1,13)]})

# 使用numpy向量化操作计算每个事件的假期数量
df1['holiday_count'] = df1.apply(lambda x: np.sum((x['Start'] <= df2['Holidays']) & (df2['Holidays'] <= x['End'])), axis=1)

# 显示结果
print(df1)

这个方法使用了向量化操作,可以显著提高处理效率,特别是在处理大型数据集时。希望这可以帮助你解决问题。

英文:

I have dataframe of events with start and end dates like this:

import pandas as pd
from datetime import datetime
df1 = pd.DataFrame({&quot;Event&quot;: [&quot;S1&quot;, &quot;K1&quot;, &quot;S2&quot;, &quot;S3&quot;, &quot;A1&quot;],
                    &quot;Start&quot;: [datetime(2022,1,4), datetime(2022,1,15), datetime(2022,9,12), datetime(2022,11,11), datetime(2022,5,29)],
                    &quot;End&quot;: [datetime(2022,1,19), datetime(2022,1, 29), datetime(2022,9,27), datetime(2022,11,22), datetime(2022,6,15)]
                   })

Note: The &quot;Event&quot; column may not have unique values.

I have another dataframe which contains all the holidays:

df2 = pd.DataFrame({&quot;Holidays&quot;: [datetime(2022,1,1), datetime(2022,1,6), datetime(2022,1,13), ....]})

I want to know for every event how many holidays are there in between the start and end date both inclusive. My solution:

df[&#39;holiday_count&#39;] = df.apply(lambda x: len(set(pd.date_range(x[&#39;Start&#39;], x[&#39;End&#39;])).intersection(set(holidays[&#39;Holidays&#39;]))), axis=1)

I realize that my solution is quite inefficient for large dataset of df1. Here are a few things which I tried:

  1. Since, it is not an exact match, df1.merge wouldn't help.
  2. I tried using pd.merge_asof, however, the joins count only to 1. Over here, the start and end period may contain multiple holidays or no holidays as well.
  3. I tried using pd.IntervalIndex. The issue over there I faced is KeyError for those ranges where there were no holidays.
  4. cross merge followed by filter is one option, but I think, it'd have a high memory imprint which I want to avoid.
  5. Although didn't try, but people were suggesting to use pandas_sql. However, there were comments stating it is slow method.

These trials were based on several stackoverflow questions in the past like:

  1. https://stackoverflow.com/questions/44367672/best-way-to-join-merge-by-range-in-pandas
  2. https://stackoverflow.com/questions/46179362/fastest-way-to-merge-pandas-dataframe-on-ranges
  3. https://stackoverflow.com/questions/30627968/merge-pandas-dataframes-where-one-value-is-between-two-others
  4. https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range

答案1

得分: 3

你可以尝试这个 [tag:numpy] 方法

sdates, edates = df1["Start"].values, df1["End"].values
hdates = df2["Holidays"].values[:, None]

df1["holiday_count"] = np.sum((hdates >= sdates) & (hdates <= end_dates), axis=0)

# 801 微秒 ± 45.7 微秒每次循环(7 次运行的平均值 ± 标准差,1,000 次循环每次)

输出:

print(df1)

  Event      Start        End  holiday_count
0    S1 2022-01-04 2022-01-19              2
1    K1 2022-01-15 2022-01-29              0
2    S2 2022-09-12 2022-09-27              0
3    S3 2022-11-11 2022-11-22              0
4    A1 2022-05-29 2022-06-15              0
英文:

You can try this [tag:numpy] approach :

sdates, edates = df1[&quot;Start&quot;].values, df1[&quot;End&quot;].values
hdates = df2[&quot;Holidays&quot;].values[:, None]

df1[&quot;holiday_count&quot;] = np.sum((hdates &gt;= sdates) &amp; (hdates &lt;= end_dates), axis=0)

# 801 &#181;s &#177; 45.7 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1,000 loops each)

Output :

print(df1)

  Event      Start        End  holiday_count
0    S1 2022-01-04 2022-01-19              2
1    K1 2022-01-15 2022-01-29              0
2    S2 2022-09-12 2022-09-27              0
3    S3 2022-11-11 2022-11-22              0
4    A1 2022-05-29 2022-06-15              0

答案2

得分: 0

我建议这是你问题的一个替代解决方案

# 计算两个日期之间(包括起始日期和结束日期)的假期数量的函数
def count_holidays(start_date, end_date, holidays):
    return sum(start_date <= holiday <= end_date for holiday in holidays)

# 在"Event"上合并df1和df2
merged_df = pd.merge(df1, df2, left_on="Event", right_on="Event", how="left")

# 为每个事件计算假期数量
merged_df["HolidaysCount"] = merged_df.apply(lambda row: count_holidays(row["Start"], row["End"], row["Holidays"]), axis=1)

# 删除"Holidays"列
merged_df.drop("Holidays", axis=1, inplace=True)

print(merged_df)
英文:

I would also suggest this is an alternative solution to your question

 # Function to count the number of holidays between two dates (inclusive)
def count_holidays(start_date, end_date, holidays):
    return sum(start_date &lt;= holiday &lt;= end_date for holiday in holidays)

# Merge df1 and df2 on Event
merged_df = pd.merge(df1, df2, left_on=&quot;Event&quot;, right_on=&quot;Event&quot;, how=&quot;left&quot;)

# Count holidays for each event
merged_df[&quot;HolidaysCount&quot;] = merged_df.apply(lambda row: count_holidays(row[&quot;Start&quot;], row[&quot;End&quot;], row[&quot;Holidays&quot;]), axis=1)

# Drop the Holidays column
merged_df.drop(&quot;Holidays&quot;, axis=1, inplace=True)

print(merged_df)

答案3

得分: 0

这是一个不等式连接,可以使用conditional_join来高效解决:

# pip install pyjanitor
# 为了更好的性能,如果可能的话,
# 安装开发版本:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

(df1
.conditional_join(
    df2, 
    ('Start', 'Holidays', '<='), 
    ('End', 'Holidays', '>='), 
    how='left')
.groupby(df1.columns.tolist(), sort=False)
.count()
)
                             Holidays
Event Start      End                 
S1    2022-01-04 2022-01-19         2
K1    2022-01-15 2022-01-29         0
S2    2022-09-12 2022-09-27         0
S3    2022-11-11 2022-11-22         0
A1    2022-05-29 2022-06-15         0

在内部,它使用二分搜索而不是笛卡尔连接。对于大型数据,这提供了更高的性能/效率。

英文:

This is an inequality join, which is solved efficiently with conditional_join:

# pip install pyjanitor
# for better performance, if you can
# install the dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

(df1
.conditional_join(
    df2, 
    (&#39;Start&#39;, &#39;Holidays&#39;, &#39;&lt;=&#39;), 
    (&#39;End&#39;, &#39;Holidays&#39;, &#39;&gt;=&#39;), 
    how = &#39;left&#39;)
.groupby(df1.columns.tolist(), sort = False)
.count()
)
                             Holidays
Event Start      End                 
S1    2022-01-04 2022-01-19         2
K1    2022-01-15 2022-01-29         0
S2    2022-09-12 2022-09-27         0
S3    2022-11-11 2022-11-22         0
A1    2022-05-29 2022-06-15         0

Under the hood, it uses binary search, instead of a cartesian join. For large data, this offers more performance/efficiency.

huangapple
  • 本文由 发表于 2023年6月12日 18:15:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76455640.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定