Python : How to split the given start date and end date in a dataframe into number of days falling in each month creating new row for every date split

huangapple go评论72阅读模式
英文:

Python : How to split the given start date and end date in a dataframe into number of days falling in each month creating new row for every date split

问题

我有如下所示的数据框,其中包含ID、开始日期和结束日期等字段。
如果我们取ID号码1,开始日期和结束日期跨足了从2月15日到3月17日的3个月,即2月、3月和4月。因此,我想将日期拆分为3部分,为每个月创建一个单独的行,如下所示的所需输出。总共有大约3000万个唯一的ID。

数据 -

ID  startDate         endDate

1   15/02/2023       17/04/2023

2   10/04/2023       20/06/2023

所需输出 -

ID startDate endDate

1 15/02/2023 28/02/2023

1 01/03/2023 31/03/2023

1 01/04/2023 17/04/2023

2 10/04/2023 30/04/2023

2 01/05/2023 31/05/2023

2 01/06/2023 20/05/2023

任何帮助将不胜感激。

英文:

I have the dataframe as shown below which has fields like ID , start and end date.
If we take the ID number 1 , the start and end date is spread across 3 months staring from 15th February till 17th March which is month 2,3 and 4. So I would like to split the dates into 3 parts creating a separate row for every month as shown in the required output below. In total I have approx 30 million unique ID's.

Data -

ID  startDate         endDate

1   15/02/2023       17/04/2023

2   10/04/2023       20/06/2023```



Required Output - 


ID  startDate       endDate

1   15/02/2023      28/02/2023

1   01/03/2023      31/03/2023

1   01/04/2023      17/04/2023

2   10/04/2023      30/04/2023

2   01/05/2023      31/05/2023

2   01/06/2023      20/05/2023 ```

Any help would be very much appreciated.

</details>


# 答案1
**得分**: 0

我尚未找到一种有效的向量化操作方法,因此我不确定在处理3000万行数据时性能如何。

```python
def split_dates(row):
    row = row.iloc[0]
    months = int(np.datetime64(row.endDate, "M") - np.datetime64(row.startDate, "M"))

    start = np.datetime64(row.startDate, "D")
    end = np.datetime64(row.endDate, "D")

    ends = np.array(
        [start + i * pd.offsets.MonthEnd() for i in range(1, months + 1)],
        dtype="datetime64[D]",
    )
    starts = ends + np.timedelta64(1, "D")

    starts = np.insert(starts, 0, start)
    ends = np.append(ends, end)

    new_df = pd.DataFrame({"startDate": starts, "endDate": ends})

    return new_df


df.groupby("ID").apply(split_dates).reset_index().drop(columns="level_1")
英文:

I haven't been able to find a good way to vectorize this operation, so I'm not sure how this will perform across 30 million rows.

def split_dates(row):
    row = row.iloc[0]
    months = int(np.datetime64(row.endDate, &quot;M&quot;) - np.datetime64(row.startDate, &quot;M&quot;))

    start = np.datetime64(row.startDate, &quot;D&quot;)
    end = np.datetime64(row.endDate, &quot;D&quot;)

    ends = np.array(
        [start + i * pd.offsets.MonthEnd() for i in range(1, months + 1)],
        dtype=&quot;datetime64[D]&quot;,
    )
    starts = ends + np.timedelta64(1, &quot;D&quot;)

    starts = np.insert(starts, 0, start)
    ends = np.append(ends, end)

    new_df = pd.DataFrame({&quot;startDate&quot;: starts, &quot;endDate&quot;: ends})

    return new_df


df.groupby(&quot;ID&quot;).apply(split_dates).reset_index().drop(columns=&quot;level_1&quot;)

答案2

得分: 0

只对代码进行翻译,如下所示:

# 对 @Michaels 的代码进行一些更改,以处理日期数据类型,如果它是一个对象的话。

def split_dates(row):
    row = row.iloc[0]
    months = int(np.datetime64(row.endDate, "M") - np.datetime64(row.startDate, "M"))

    start = np.datetime64(row.startDate, "D")
    end = np.datetime64(row.endDate, "D")

    ends = np.array(
        [pd.to_datetime(start + i, format="%Y-%m-%d") + pd.offsets.MonthEnd() for 
        i in range(1, months + 1)],
        dtype="datetime64[D]",
    )
    starts = ends + np.timedelta64(1, "D")

    starts = np.insert(starts, 0, start)
    ends = np.append(ends, end)

    new_df = pd.DataFrame({"startDate": starts, "endDate": ends})

    return new_df

df.groupby("ID").apply(split_dates).reset_index().drop(columns="level_1")

注意:这是您提供的代码的翻译,没有对代码的功能或逻辑进行更改。

英文:

Just making few changes to @Michaels's code to handle the date datatype if incase its a object.

    def split_dates(row):
      row = row.iloc[0]
      months = int(np.datetime64(row.endDate, &quot;M&quot;) - p.datetime64(row.startDate, &quot;M&quot;))

      start = np.datetime64(row.startDate, &quot;D&quot;)
      end = np.datetime64(row.endDate, &quot;D&quot;)

      ends = np.array(
      [pd.to_datetime(start + i, format=&quot;%Y-%m-%d&quot;) + pd.offsets.MonthEnd() for 
      i in range(1, months + 1)],
      dtype=&quot;datetime64[D]&quot;,
      )
      starts = ends + np.timedelta64(1, &quot;D&quot;)

      starts = np.insert(starts, 0, start)
      ends = np.append(ends, end)

      new_df = pd.DataFrame({&quot;startDate&quot;: starts, &quot;endDate&quot;: ends})

      return new_df


    df.groupby(&quot;ID&quot;).apply(split_dates).reset_index().drop(columns=&quot;level_1&quot;)

huangapple
  • 本文由 发表于 2023年7月12日 22:21:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76671597.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定