Python : How to split the given start date and end date in a dataframe into number of days falling in each month creating new row for every date split

huangapple go评论103阅读模式
英文:

Python : How to split the given start date and end date in a dataframe into number of days falling in each month creating new row for every date split

问题

我有如下所示的数据框,其中包含ID、开始日期和结束日期等字段。
如果我们取ID号码1,开始日期和结束日期跨足了从2月15日到3月17日的3个月,即2月、3月和4月。因此,我想将日期拆分为3部分,为每个月创建一个单独的行,如下所示的所需输出。总共有大约3000万个唯一的ID。

数据 -

  1. ID startDate endDate
  2. 1 15/02/2023 17/04/2023
  3. 2 10/04/2023 20/06/2023

所需输出 -

ID startDate endDate

1 15/02/2023 28/02/2023

1 01/03/2023 31/03/2023

1 01/04/2023 17/04/2023

2 10/04/2023 30/04/2023

2 01/05/2023 31/05/2023

2 01/06/2023 20/05/2023

任何帮助将不胜感激。

英文:

I have the dataframe as shown below which has fields like ID , start and end date.
If we take the ID number 1 , the start and end date is spread across 3 months staring from 15th February till 17th March which is month 2,3 and 4. So I would like to split the dates into 3 parts creating a separate row for every month as shown in the required output below. In total I have approx 30 million unique ID's.

Data -

  1. ID startDate endDate
  2. 1 15/02/2023 17/04/2023
  3. 2 10/04/2023 20/06/2023```
  4. Required Output -
  5. ID startDate endDate
  6. 1 15/02/2023 28/02/2023
  7. 1 01/03/2023 31/03/2023
  8. 1 01/04/2023 17/04/2023
  9. 2 10/04/2023 30/04/2023
  10. 2 01/05/2023 31/05/2023
  11. 2 01/06/2023 20/05/2023 ```
  12. Any help would be very much appreciated.
  13. </details>
  14. # 答案1
  15. **得分**: 0
  16. 我尚未找到一种有效的向量化操作方法,因此我不确定在处理3000万行数据时性能如何。
  17. ```python
  18. def split_dates(row):
  19. row = row.iloc[0]
  20. months = int(np.datetime64(row.endDate, "M") - np.datetime64(row.startDate, "M"))
  21. start = np.datetime64(row.startDate, "D")
  22. end = np.datetime64(row.endDate, "D")
  23. ends = np.array(
  24. [start + i * pd.offsets.MonthEnd() for i in range(1, months + 1)],
  25. dtype="datetime64[D]",
  26. )
  27. starts = ends + np.timedelta64(1, "D")
  28. starts = np.insert(starts, 0, start)
  29. ends = np.append(ends, end)
  30. new_df = pd.DataFrame({"startDate": starts, "endDate": ends})
  31. return new_df
  32. df.groupby("ID").apply(split_dates).reset_index().drop(columns="level_1")
英文:

I haven't been able to find a good way to vectorize this operation, so I'm not sure how this will perform across 30 million rows.

  1. def split_dates(row):
  2. row = row.iloc[0]
  3. months = int(np.datetime64(row.endDate, &quot;M&quot;) - np.datetime64(row.startDate, &quot;M&quot;))
  4. start = np.datetime64(row.startDate, &quot;D&quot;)
  5. end = np.datetime64(row.endDate, &quot;D&quot;)
  6. ends = np.array(
  7. [start + i * pd.offsets.MonthEnd() for i in range(1, months + 1)],
  8. dtype=&quot;datetime64[D]&quot;,
  9. )
  10. starts = ends + np.timedelta64(1, &quot;D&quot;)
  11. starts = np.insert(starts, 0, start)
  12. ends = np.append(ends, end)
  13. new_df = pd.DataFrame({&quot;startDate&quot;: starts, &quot;endDate&quot;: ends})
  14. return new_df
  15. df.groupby(&quot;ID&quot;).apply(split_dates).reset_index().drop(columns=&quot;level_1&quot;)

答案2

得分: 0

只对代码进行翻译,如下所示:

  1. # 对 @Michaels 的代码进行一些更改,以处理日期数据类型,如果它是一个对象的话。
  2. def split_dates(row):
  3. row = row.iloc[0]
  4. months = int(np.datetime64(row.endDate, "M") - np.datetime64(row.startDate, "M"))
  5. start = np.datetime64(row.startDate, "D")
  6. end = np.datetime64(row.endDate, "D")
  7. ends = np.array(
  8. [pd.to_datetime(start + i, format="%Y-%m-%d") + pd.offsets.MonthEnd() for
  9. i in range(1, months + 1)],
  10. dtype="datetime64[D]",
  11. )
  12. starts = ends + np.timedelta64(1, "D")
  13. starts = np.insert(starts, 0, start)
  14. ends = np.append(ends, end)
  15. new_df = pd.DataFrame({"startDate": starts, "endDate": ends})
  16. return new_df
  17. df.groupby("ID").apply(split_dates).reset_index().drop(columns="level_1")

注意:这是您提供的代码的翻译,没有对代码的功能或逻辑进行更改。

英文:

Just making few changes to @Michaels's code to handle the date datatype if incase its a object.

  1. def split_dates(row):
  2. row = row.iloc[0]
  3. months = int(np.datetime64(row.endDate, &quot;M&quot;) - p.datetime64(row.startDate, &quot;M&quot;))
  4. start = np.datetime64(row.startDate, &quot;D&quot;)
  5. end = np.datetime64(row.endDate, &quot;D&quot;)
  6. ends = np.array(
  7. [pd.to_datetime(start + i, format=&quot;%Y-%m-%d&quot;) + pd.offsets.MonthEnd() for
  8. i in range(1, months + 1)],
  9. dtype=&quot;datetime64[D]&quot;,
  10. )
  11. starts = ends + np.timedelta64(1, &quot;D&quot;)
  12. starts = np.insert(starts, 0, start)
  13. ends = np.append(ends, end)
  14. new_df = pd.DataFrame({&quot;startDate&quot;: starts, &quot;endDate&quot;: ends})
  15. return new_df
  16. df.groupby(&quot;ID&quot;).apply(split_dates).reset_index().drop(columns=&quot;level_1&quot;)

huangapple
  • 本文由 发表于 2023年7月12日 22:21:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76671597.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定