2023年6月13日 14:57:02go评论140阅读模式

英文:

Breakdown of time period (duration) to month/year

问题

| 初始数据表格如下所示：

编号	开始时间	结束时间	持续天数
1	2023-05-20 12:00:00.000	2023-06-03 12:00:00.000	14
2	2023-05-20 12:00:00.000	2023-05-23 12:00:00.000	3
1	2023-06-01 12:00:00.000	2023-06-03 12:00:00.000	2

如何将其转换为以下的数据表格？问题在于行数相当多（数百万）。因此性能非常重要。

编号	年份	月份	持续天数
1	2023	五月	11
2	2023	五月	3
1	2023	六月	5

更新： 请注意，可能会有多个月份分隔。例如：2023年2月20日（二月）和2023年12月18日（十二月）。

英文:

My initial dataframe looks as follows:

ID	Start	End	DurationDays
1	2023-05-20 12:00:00.000	2023-06-03 12:00:00.000	14
2	2023-05-20 12:00:00.000	2023-05-23 12:00:00.000	3
1	2023-06-01 12:00:00.000	2023-06-03 12:00:00.000	2

How to get from this to a dataframe like the following? The problem is, there are quite a low of rows (millions). Therefor performance is quite important.

ID	Year	Month	DurationDays
1	2023	May	11
2	2023	May	3
1	2023	June	5

Update: Please note, there could be more than one month to break. Example: 2023-02-20 (February) and 2023-12-18 (December)

答案1

得分: 1

以下是您要翻译的代码部分：

首先的想法是通过End/Start列之间的天数差重复行，并按GroupBy.size计算每个ID/Year/Month的行数：

df[['Start', 'End']] = df[['Start', 'End']].apply(pd.to_datetime)
df = df.loc[df.index.repeat(df['End'].sub(df['Start']).dt.days)]
s = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df['Date'] = df['Start'].add(s)
df = (df.groupby(['ID', df['Date'].dt.year.rename('Year'),
                       df['Date'].dt.month_name().rename('Month')], sort=False)
        .size()
        .reset_index(name='DurationDays'))
print(df)
   ID  Year Month  DurationDays
0   1  2023   May            12
1   1  2023  June             4
2   2  2023   May             3

为了更精确，可以使用小时而不是天数：

df[['Start', 'End']] = df[['Start', 'End']].apply(pd.to_datetime)
df = df.loc[df.index.repeat(df['End'].sub(df['Start']).dt.total_seconds().div(3600))]
s = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='h')
df['Start'] = df['Start'].add(s)
df = (df.assign(Year=df['Start'].dt.year,
                Month=df['Start'].dt.month_name())
          .groupby(['ID', 'Year','Month'], sort=False)
          .size()
          .div(24)
          .reset_index(name='DurationDays'))
print(df)
   ID  Year Month  DurationDays
0   1  2023   May          11.5
1   1  2023  June           4.5
2   2  2023   May           3.0

希望这有助于您理解代码。

英文:

First idea is repeat rows by difference of End/Start columns in days and count number of rows per ID/Year/Month by GroupBy.size:

df[[&#39;Start&#39;,&#39;End&#39;]] = df[[&#39;Start&#39;,&#39;End&#39;]].apply(pd.to_datetime)
df = df.loc[df.index.repeat(df[&#39;End&#39;].sub(df[&#39;Start&#39;]).dt.days)]
s = pd.to_timedelta(df.groupby(level=0).cumcount(), unit=&#39;d&#39;)
df[&#39;Date&#39;] = df[&#39;Start&#39;].add(s)
df = (df.groupby([&#39;ID&#39;, df[&#39;Date&#39;].dt.year.rename(&#39;Year&#39;),
                       df[&#39;Date&#39;].dt.month_name().rename(&#39;Month&#39;)], sort=False)
        .size()
        .reset_index(name=&#39;DurationDays&#39;))
print (df)
   ID  Year Month  DurationDays
0   1  2023   May            12
1   1  2023  June             4
2   2  2023   May             3

For better precision is possible use hours instead days:

df[[&#39;Start&#39;,&#39;End&#39;]] = df[[&#39;Start&#39;,&#39;End&#39;]].apply(pd.to_datetime)
df = df.loc[df.index.repeat(df[&#39;End&#39;].sub(df[&#39;Start&#39;]).dt.total_seconds().div(3600))]
s = pd.to_timedelta(df.groupby(level=0).cumcount(), unit=&#39;h&#39;)
df[&#39;Start&#39;] = df[&#39;Start&#39;].add(s)
df = (df.assign(Year=df[&#39;Start&#39;].dt.year,
                Month=df[&#39;Start&#39;].dt.month_name())
          .groupby([&#39;ID&#39;, &#39;Year&#39;,&#39;Month&#39;], sort=False)
          .size()
          .div(24)
          .reset_index(name=&#39;DurationDays&#39;))
print (df)
   ID  Year Month  DurationDays
0   1  2023   May          11.5
1   1  2023  June           4.5
2   2  2023   May           3.0

答案2

得分: 1

@jezrael的方法是正确的方向，但对于大型DataFrame来说可能不太现实，因为它需要按天/小时重复行（从而创建一个具有数百万或数千万行的中间结果）。

相反，似乎更合理的方法是使用pandas的Period作为重复行的单位。

简而言之，这会使用repeat在其边界处打破周期的行，并使用groupby.cumcount增加周期，使用start_time/end_time获取边界，最后使用groupby.sum来添加每个周期的持续时间。

我自愿添加了几个中间变量，以使代码更容易理解，但如果需要，许多步骤可以合并成一个步骤，以避免创建太多的中间列。

英文:

@jezrael's approach is on the right track, but probably unrealistic for a large DataFrame as it requires to repeat the rows by days/hours (thus creating an intermediate with tens or hundreds of millions of rows).

Instead, it seems more reasonable to use pandas' Period as unit to repeat rows.

In short, this breaks the rows by period at their boundaries using repeat and increments the periods with groupby.cumcount, gets the boundaries with start_time/end_time, finally groupby.sum to add the duration per period.

I voluntarily added several intermediate variables to make it easier to understand the code, but if needed many steps could be combined into one to avoid creating too many intermediate columns.

df[[&#39;Start&#39;, &#39;End&#39;]] = df[[&#39;Start&#39;, &#39;End&#39;]].apply(pd.to_datetime)
period = &#39;M&#39;
out = (df.assign(start=df[&#39;Start&#39;].dt.to_period(period),
                 end=df[&#39;End&#39;].dt.to_period(period),
                 diff=lambda d: d[&#39;end&#39;].sub(d[&#39;start&#39;]).apply(lambda x: x.n))
         .loc[lambda d: d.index.repeat(d[&#39;diff&#39;]+1)]
         .assign(n=lambda d: d.groupby(level=0).cumcount(),
                 Start=lambda d: d[&#39;Start&#39;].mask(d.index.duplicated(), d[&#39;start&#39;].add(d[&#39;n&#39;]).dt.start_time),
                 End=lambda d: d[&#39;End&#39;].mask(d.index.duplicated(keep=&#39;last&#39;), d[&#39;end&#39;].add(d[&#39;n&#39;]).dt.start_time),
                 DurationDays=lambda d: d[&#39;End&#39;].sub(d[&#39;Start&#39;]),
                 Year=lambda d: d[&#39;Start&#39;].dt.year,
                 Month=lambda d: d[&#39;Start&#39;].dt.month_name(),
                )
        .groupby([&#39;ID&#39;, &#39;Year&#39;, &#39;Month&#39;], as_index=False, sort=False)
       [&#39;DurationDays&#39;].sum()
      )

Output:

   ID  Year Month     DurationDays
0   1  2023   May 11 days 12:00:00
1   1  2023  June  4 days 12:00:00
2   2  2023   May  3 days 00:00:00

Intermediate before the groupby.sum:

   ID               Start                 End     DurationDays    start      end  diff  n  Year Month
0   1 2023-05-20 12:00:00 2023-06-01 00:00:00 11 days 12:00:00  2023-05  2023-06     1  0  2023   May
0   1 2023-06-01 00:00:00 2023-06-03 12:00:00  2 days 12:00:00  2023-05  2023-06     1  1  2023  June
1   2 2023-05-20 12:00:00 2023-05-23 12:00:00  3 days 00:00:00  2023-05  2023-05     0  0  2023   May
2   1 2023-06-01 12:00:00 2023-06-03 12:00:00  2 days 00:00:00  2023-06  2023-06     0  0  2023  June

generalization

For a more generic method that you can use with any period (provided you have enough resources):

df[[&#39;Start&#39;, &#39;End&#39;]] = df[[&#39;Start&#39;, &#39;End&#39;]].apply(pd.to_datetime)
period = &#39;Y&#39;
out = (df.assign(Period=df[&#39;Start&#39;].dt.to_period(period),
                 end=df[&#39;End&#39;].dt.to_period(period),
                 diff=lambda d: d[&#39;end&#39;].sub(d[&#39;Period&#39;]).apply(lambda x: x.n))
         .loc[lambda d: d.index.repeat(d[&#39;diff&#39;]+1)]
         .assign(n=lambda d: d.groupby(level=0).cumcount(),
                 Start=lambda d: d[&#39;Start&#39;].mask(d.index.duplicated(), d[&#39;Period&#39;].add(d[&#39;n&#39;]).dt.start_time),
                 End=lambda d: d[&#39;End&#39;].mask(d.index.duplicated(keep=&#39;last&#39;), d[&#39;end&#39;].add(d[&#39;n&#39;]).dt.start_time),
                 DurationDays=lambda d: d[&#39;End&#39;].sub(d[&#39;Start&#39;]),
                )
        .groupby([&#39;ID&#39;, &#39;Period&#39;], as_index=False, sort=False)
       [&#39;DurationDays&#39;].sum()
      )

Output for period = 'Y':

   ID Period DurationDays
0   1   2023      16 days
1   2   2023       3 days

Output for period = 'M':

   ID   Period DurationDays
0   1  2023-05      14 days
1   2  2023-05       3 days
2   1  2023-06       2 days

答案3

得分: 1

你可以使用lreshape重塑你的DataFrame，然后使用split_months（稍微调整一下）：

tmp = (
    pd.lreshape(df, {'Datetime': ['Start', 'End']})
        .sort_values(by=['ID', 'Datetime'])
)
out = (
    tmp.groupby(['ID']).apply(
        lambda g: split_months(g['Datetime'].min(), g['Datetime'].max())
    ).reset_index('ID')
)

输出：

print(out)
   ID  Month  Year  DurationDays
1   1    May  2023         11.00
2   1   June  2023          3.00 # &lt;-- 需要修复（应为 `2+3`）
1   2    May  2023          3.00

英文:

You can lreshape your DataFrame then use split_months (slightly adjusted) :

tmp = (
    pd.lreshape(df, {&#39;Datetime&#39;: [&#39;Start&#39;, &#39;End&#39;]})
        .sort_values(by=[&#39;ID&#39;, &#39;Datetime&#39;])
)
out = (
    tmp.groupby([&#39;ID&#39;]).apply(
        lambda g: split_months(g[&#39;Datetime&#39;].min(), g[&#39;Datetime&#39;].max())
    ).reset_index(&#39;ID&#39;)
)

Output :

print(out)
   ID Month  Year  DurationDays
1   1   May  2023         11.00
2   1  June  2023          3.00 # &lt;-- needs to be fixed (it is `2+3`)
1   2   May  2023          3.00

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

时间段（持续时间）拆分为月/年

问题

答案1

答案2

generalization

答案3

Partition numpy array in-place by condition.

在 pandas 的 groupby 条件下，基于子组内匹配值的条件性字符串连接。

Python TKinter只读文本框不显示内容

选择性终止Python多进程中的进程。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论