英文:
Python : How to split the given start date and end date in a dataframe into number of days falling in each month creating new row for every date split
问题
我有如下所示的数据框,其中包含ID、开始日期和结束日期等字段。
如果我们取ID号码1,开始日期和结束日期跨足了从2月15日到3月17日的3个月,即2月、3月和4月。因此,我想将日期拆分为3部分,为每个月创建一个单独的行,如下所示的所需输出。总共有大约3000万个唯一的ID。
数据 -
ID startDate endDate
1 15/02/2023 17/04/2023
2 10/04/2023 20/06/2023
所需输出 -
ID startDate endDate
1 15/02/2023 28/02/2023
1 01/03/2023 31/03/2023
1 01/04/2023 17/04/2023
2 10/04/2023 30/04/2023
2 01/05/2023 31/05/2023
2 01/06/2023 20/05/2023
任何帮助将不胜感激。
英文:
I have the dataframe as shown below which has fields like ID , start and end date.
If we take the ID number 1 , the start and end date is spread across 3 months staring from 15th February till 17th March which is month 2,3 and 4. So I would like to split the dates into 3 parts creating a separate row for every month as shown in the required output below. In total I have approx 30 million unique ID's.
Data -
ID startDate endDate
1 15/02/2023 17/04/2023
2 10/04/2023 20/06/2023```
Required Output -
ID startDate endDate
1 15/02/2023 28/02/2023
1 01/03/2023 31/03/2023
1 01/04/2023 17/04/2023
2 10/04/2023 30/04/2023
2 01/05/2023 31/05/2023
2 01/06/2023 20/05/2023 ```
Any help would be very much appreciated.
</details>
# 答案1
**得分**: 0
我尚未找到一种有效的向量化操作方法,因此我不确定在处理3000万行数据时性能如何。
```python
def split_dates(row):
row = row.iloc[0]
months = int(np.datetime64(row.endDate, "M") - np.datetime64(row.startDate, "M"))
start = np.datetime64(row.startDate, "D")
end = np.datetime64(row.endDate, "D")
ends = np.array(
[start + i * pd.offsets.MonthEnd() for i in range(1, months + 1)],
dtype="datetime64[D]",
)
starts = ends + np.timedelta64(1, "D")
starts = np.insert(starts, 0, start)
ends = np.append(ends, end)
new_df = pd.DataFrame({"startDate": starts, "endDate": ends})
return new_df
df.groupby("ID").apply(split_dates).reset_index().drop(columns="level_1")
英文:
I haven't been able to find a good way to vectorize this operation, so I'm not sure how this will perform across 30 million rows.
def split_dates(row):
row = row.iloc[0]
months = int(np.datetime64(row.endDate, "M") - np.datetime64(row.startDate, "M"))
start = np.datetime64(row.startDate, "D")
end = np.datetime64(row.endDate, "D")
ends = np.array(
[start + i * pd.offsets.MonthEnd() for i in range(1, months + 1)],
dtype="datetime64[D]",
)
starts = ends + np.timedelta64(1, "D")
starts = np.insert(starts, 0, start)
ends = np.append(ends, end)
new_df = pd.DataFrame({"startDate": starts, "endDate": ends})
return new_df
df.groupby("ID").apply(split_dates).reset_index().drop(columns="level_1")
答案2
得分: 0
只对代码进行翻译,如下所示:
# 对 @Michaels 的代码进行一些更改,以处理日期数据类型,如果它是一个对象的话。
def split_dates(row):
row = row.iloc[0]
months = int(np.datetime64(row.endDate, "M") - np.datetime64(row.startDate, "M"))
start = np.datetime64(row.startDate, "D")
end = np.datetime64(row.endDate, "D")
ends = np.array(
[pd.to_datetime(start + i, format="%Y-%m-%d") + pd.offsets.MonthEnd() for
i in range(1, months + 1)],
dtype="datetime64[D]",
)
starts = ends + np.timedelta64(1, "D")
starts = np.insert(starts, 0, start)
ends = np.append(ends, end)
new_df = pd.DataFrame({"startDate": starts, "endDate": ends})
return new_df
df.groupby("ID").apply(split_dates).reset_index().drop(columns="level_1")
注意:这是您提供的代码的翻译,没有对代码的功能或逻辑进行更改。
英文:
Just making few changes to @Michaels's code to handle the date datatype if incase its a object.
def split_dates(row):
row = row.iloc[0]
months = int(np.datetime64(row.endDate, "M") - p.datetime64(row.startDate, "M"))
start = np.datetime64(row.startDate, "D")
end = np.datetime64(row.endDate, "D")
ends = np.array(
[pd.to_datetime(start + i, format="%Y-%m-%d") + pd.offsets.MonthEnd() for
i in range(1, months + 1)],
dtype="datetime64[D]",
)
starts = ends + np.timedelta64(1, "D")
starts = np.insert(starts, 0, start)
ends = np.append(ends, end)
new_df = pd.DataFrame({"startDate": starts, "endDate": ends})
return new_df
df.groupby("ID").apply(split_dates).reset_index().drop(columns="level_1")
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论