2023年7月12日 22:21:18go评论103阅读模式

英文:

Python : How to split the given start date and end date in a dataframe into number of days falling in each month creating new row for every date split

问题

我有如下所示的数据框，其中包含ID、开始日期和结束日期等字段。
如果我们取ID号码1，开始日期和结束日期跨足了从2月15日到3月17日的3个月，即2月、3月和4月。因此，我想将日期拆分为3部分，为每个月创建一个单独的行，如下所示的所需输出。总共有大约3000万个唯一的ID。

数据 -

ID  startDate         endDate
1   15/02/2023       17/04/2023
2   10/04/2023       20/06/2023

所需输出 -

ID startDate endDate

1 15/02/2023 28/02/2023

1 01/03/2023 31/03/2023

1 01/04/2023 17/04/2023

2 10/04/2023 30/04/2023

2 01/05/2023 31/05/2023

2 01/06/2023 20/05/2023

任何帮助将不胜感激。

英文:

I have the dataframe as shown below which has fields like ID , start and end date.
If we take the ID number 1 , the start and end date is spread across 3 months staring from 15th February till 17th March which is month 2,3 and 4. So I would like to split the dates into 3 parts creating a separate row for every month as shown in the required output below. In total I have approx 30 million unique ID's.

Data -

ID  startDate         endDate
1   15/02/2023       17/04/2023
2   10/04/2023       20/06/2023```
Required Output - 
ID  startDate       endDate
1   15/02/2023      28/02/2023
1   01/03/2023      31/03/2023
1   01/04/2023      17/04/2023
2   10/04/2023      30/04/2023
2   01/05/2023      31/05/2023
2   01/06/2023      20/05/2023 ```
Any help would be very much appreciated.
</details>
# 答案1
**得分**: 0
我尚未找到一种有效的向量化操作方法，因此我不确定在处理3000万行数据时性能如何。
```python
def split_dates(row):
    row = row.iloc[0]
    months = int(np.datetime64(row.endDate, "M") - np.datetime64(row.startDate, "M"))
    start = np.datetime64(row.startDate, "D")
    end = np.datetime64(row.endDate, "D")
    ends = np.array(
        [start + i * pd.offsets.MonthEnd() for i in range(1, months + 1)],
        dtype="datetime64[D]",
    )
    starts = ends + np.timedelta64(1, "D")
    starts = np.insert(starts, 0, start)
    ends = np.append(ends, end)
    new_df = pd.DataFrame({"startDate": starts, "endDate": ends})
    return new_df
df.groupby("ID").apply(split_dates).reset_index().drop(columns="level_1")

英文:

I haven't been able to find a good way to vectorize this operation, so I'm not sure how this will perform across 30 million rows.

def split_dates(row):
    row = row.iloc[0]
    months = int(np.datetime64(row.endDate, &quot;M&quot;) - np.datetime64(row.startDate, &quot;M&quot;))
    start = np.datetime64(row.startDate, &quot;D&quot;)
    end = np.datetime64(row.endDate, &quot;D&quot;)
    ends = np.array(
        [start + i * pd.offsets.MonthEnd() for i in range(1, months + 1)],
        dtype=&quot;datetime64[D]&quot;,
    )
    starts = ends + np.timedelta64(1, &quot;D&quot;)
    starts = np.insert(starts, 0, start)
    ends = np.append(ends, end)
    new_df = pd.DataFrame({&quot;startDate&quot;: starts, &quot;endDate&quot;: ends})
    return new_df
df.groupby(&quot;ID&quot;).apply(split_dates).reset_index().drop(columns=&quot;level_1&quot;)

答案2

得分: 0

只对代码进行翻译，如下所示：

# 对 @Michaels 的代码进行一些更改，以处理日期数据类型，如果它是一个对象的话。
def split_dates(row):
    row = row.iloc[0]
    months = int(np.datetime64(row.endDate, "M") - np.datetime64(row.startDate, "M"))
    start = np.datetime64(row.startDate, "D")
    end = np.datetime64(row.endDate, "D")
    ends = np.array(
        [pd.to_datetime(start + i, format="%Y-%m-%d") + pd.offsets.MonthEnd() for 
        i in range(1, months + 1)],
        dtype="datetime64[D]",
    )
    starts = ends + np.timedelta64(1, "D")
    starts = np.insert(starts, 0, start)
    ends = np.append(ends, end)
    new_df = pd.DataFrame({"startDate": starts, "endDate": ends})
    return new_df
df.groupby("ID").apply(split_dates).reset_index().drop(columns="level_1")

注意：这是您提供的代码的翻译，没有对代码的功能或逻辑进行更改。

英文:

Just making few changes to @Michaels's code to handle the date datatype if incase its a object.

    def split_dates(row):
      row = row.iloc[0]
      months = int(np.datetime64(row.endDate, &quot;M&quot;) - p.datetime64(row.startDate, &quot;M&quot;))
      start = np.datetime64(row.startDate, &quot;D&quot;)
      end = np.datetime64(row.endDate, &quot;D&quot;)
      ends = np.array(
      [pd.to_datetime(start + i, format=&quot;%Y-%m-%d&quot;) + pd.offsets.MonthEnd() for 
      i in range(1, months + 1)],
      dtype=&quot;datetime64[D]&quot;,
      )
      starts = ends + np.timedelta64(1, &quot;D&quot;)
      starts = np.insert(starts, 0, start)
      ends = np.append(ends, end)
      new_df = pd.DataFrame({&quot;startDate&quot;: starts, &quot;endDate&quot;: ends})
      return new_df
    df.groupby(&quot;ID&quot;).apply(split_dates).reset_index().drop(columns=&quot;level_1&quot;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python : How to split the given start date and end date in a dataframe into number of days falling in each month creating new row for every date split

问题

答案2

match or nearest match of two dataframe datetime columns and read only one column value from second dataframe

在Turbo Pascal中是否有类似于[‘A’..’Z’]的Python等价物？

如何使用OuterRef与FilteredRelation？

Pandas中的分组总计

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。