时间段(持续时间)拆分为月/年

huangapple go评论91阅读模式
英文:

Breakdown of time period (duration) to month/year

问题

| 初始数据表格如下所示:

编号 开始时间 结束时间 持续天数
1 2023-05-20 12:00:00.000 2023-06-03 12:00:00.000 14
2 2023-05-20 12:00:00.000 2023-05-23 12:00:00.000 3
1 2023-06-01 12:00:00.000 2023-06-03 12:00:00.000 2

如何将其转换为以下的数据表格?问题在于行数相当多(数百万)。因此性能非常重要。

编号 年份 月份 持续天数
1 2023 五月 11
2 2023 五月 3
1 2023 六月 5

更新: 请注意,可能会有多个月份分隔。例如:2023年2月20日(二月)和2023年12月18日(十二月)。

英文:

My initial dataframe looks as follows:

ID Start End DurationDays
1 2023-05-20 12:00:00.000 2023-06-03 12:00:00.000 14
2 2023-05-20 12:00:00.000 2023-05-23 12:00:00.000 3
1 2023-06-01 12:00:00.000 2023-06-03 12:00:00.000 2

How to get from this to a dataframe like the following? The problem is, there are quite a low of rows (millions). Therefor performance is quite important.

ID Year Month DurationDays
1 2023 May 11
2 2023 May 3
1 2023 June 5

Update: Please note, there could be more than one month to break. Example: 2023-02-20 (February) and 2023-12-18 (December)

答案1

得分: 1

以下是您要翻译的代码部分:

首先的想法是通过End/Start列之间的天数差重复行,并按GroupBy.size计算每个ID/Year/Month的行数:

df[['Start', 'End']] = df[['Start', 'End']].apply(pd.to_datetime)

df = df.loc[df.index.repeat(df['End'].sub(df['Start']).dt.days)]

s = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df['Date'] = df['Start'].add(s)

df = (df.groupby(['ID', df['Date'].dt.year.rename('Year'),
                       df['Date'].dt.month_name().rename('Month')], sort=False)
        .size()
        .reset_index(name='DurationDays'))
print(df)
   ID  Year Month  DurationDays
0   1  2023   May            12
1   1  2023  June             4
2   2  2023   May             3

为了更精确,可以使用小时而不是天数:

df[['Start', 'End']] = df[['Start', 'End']].apply(pd.to_datetime)

df = df.loc[df.index.repeat(df['End'].sub(df['Start']).dt.total_seconds().div(3600))]

s = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='h')
df['Start'] = df['Start'].add(s)

df = (df.assign(Year=df['Start'].dt.year,
                Month=df['Start'].dt.month_name())
          .groupby(['ID', 'Year','Month'], sort=False)
          .size()
          .div(24)
          .reset_index(name='DurationDays'))
print(df)
   ID  Year Month  DurationDays
0   1  2023   May          11.5
1   1  2023  June           4.5
2   2  2023   May           3.0

希望这有助于您理解代码。

英文:

First idea is repeat rows by difference of End/Start columns in days and count number of rows per ID/Year/Month by GroupBy.size:

df[['Start','End']] = df[['Start','End']].apply(pd.to_datetime)


df = df.loc[df.index.repeat(df['End'].sub(df['Start']).dt.days)]

s = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df['Date'] = df['Start'].add(s)

df = (df.groupby(['ID', df['Date'].dt.year.rename('Year'),
                       df['Date'].dt.month_name().rename('Month')], sort=False)
        .size()
        .reset_index(name='DurationDays'))
print (df)
   ID  Year Month  DurationDays
0   1  2023   May            12
1   1  2023  June             4
2   2  2023   May             3

For better precision is possible use hours instead days:

df[['Start','End']] = df[['Start','End']].apply(pd.to_datetime)


df = df.loc[df.index.repeat(df['End'].sub(df['Start']).dt.total_seconds().div(3600))]

s = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='h')
df['Start'] = df['Start'].add(s)

df = (df.assign(Year=df['Start'].dt.year,
                Month=df['Start'].dt.month_name())
          .groupby(['ID', 'Year','Month'], sort=False)
          .size()
          .div(24)
          .reset_index(name='DurationDays'))
print (df)
   ID  Year Month  DurationDays
0   1  2023   May          11.5
1   1  2023  June           4.5
2   2  2023   May           3.0

答案2

得分: 1

@jezrael的方法是正确的方向,但对于大型DataFrame来说可能不太现实,因为它需要按天/小时重复行(从而创建一个具有数百万或数千万行的中间结果)。

相反,似乎更合理的方法是使用pandas的Period作为重复行的单位。

简而言之,这会使用repeat在其边界处打破周期的行,并使用groupby.cumcount增加周期,使用start_time/end_time获取边界,最后使用groupby.sum来添加每个周期的持续时间。

我自愿添加了几个中间变量,以使代码更容易理解,但如果需要,许多步骤可以合并成一个步骤,以避免创建太多的中间列。

英文:

@jezrael's approach is on the right track, but probably unrealistic for a large DataFrame as it requires to repeat the rows by days/hours (thus creating an intermediate with tens or hundreds of millions of rows).

Instead, it seems more reasonable to use pandas' Period as unit to repeat rows.

In short, this breaks the rows by period at their boundaries using repeat and increments the periods with groupby.cumcount, gets the boundaries with start_time/end_time, finally groupby.sum to add the duration per period.

I voluntarily added several intermediate variables to make it easier to understand the code, but if needed many steps could be combined into one to avoid creating too many intermediate columns.

df[['Start', 'End']] = df[['Start', 'End']].apply(pd.to_datetime)

period = 'M'

out = (df.assign(start=df['Start'].dt.to_period(period),
                 end=df['End'].dt.to_period(period),
                 diff=lambda d: d['end'].sub(d['start']).apply(lambda x: x.n))
         .loc[lambda d: d.index.repeat(d['diff']+1)]
         .assign(n=lambda d: d.groupby(level=0).cumcount(),
                 Start=lambda d: d['Start'].mask(d.index.duplicated(), d['start'].add(d['n']).dt.start_time),
                 End=lambda d: d['End'].mask(d.index.duplicated(keep='last'), d['end'].add(d['n']).dt.start_time),
                 DurationDays=lambda d: d['End'].sub(d['Start']),
                 Year=lambda d: d['Start'].dt.year,
                 Month=lambda d: d['Start'].dt.month_name(),
                )
        .groupby(['ID', 'Year', 'Month'], as_index=False, sort=False)
       ['DurationDays'].sum()
      )

Output:

   ID  Year Month     DurationDays
0   1  2023   May 11 days 12:00:00
1   1  2023  June  4 days 12:00:00
2   2  2023   May  3 days 00:00:00

Intermediate before the groupby.sum:

   ID               Start                 End     DurationDays    start      end  diff  n  Year Month
0   1 2023-05-20 12:00:00 2023-06-01 00:00:00 11 days 12:00:00  2023-05  2023-06     1  0  2023   May
0   1 2023-06-01 00:00:00 2023-06-03 12:00:00  2 days 12:00:00  2023-05  2023-06     1  1  2023  June
1   2 2023-05-20 12:00:00 2023-05-23 12:00:00  3 days 00:00:00  2023-05  2023-05     0  0  2023   May
2   1 2023-06-01 12:00:00 2023-06-03 12:00:00  2 days 00:00:00  2023-06  2023-06     0  0  2023  June

generalization

For a more generic method that you can use with any period (provided you have enough resources):

df[['Start', 'End']] = df[['Start', 'End']].apply(pd.to_datetime)

period = 'Y'

out = (df.assign(Period=df['Start'].dt.to_period(period),
                 end=df['End'].dt.to_period(period),
                 diff=lambda d: d['end'].sub(d['Period']).apply(lambda x: x.n))
         .loc[lambda d: d.index.repeat(d['diff']+1)]
         .assign(n=lambda d: d.groupby(level=0).cumcount(),
                 Start=lambda d: d['Start'].mask(d.index.duplicated(), d['Period'].add(d['n']).dt.start_time),
                 End=lambda d: d['End'].mask(d.index.duplicated(keep='last'), d['end'].add(d['n']).dt.start_time),
                 DurationDays=lambda d: d['End'].sub(d['Start']),
                )
        .groupby(['ID', 'Period'], as_index=False, sort=False)
       ['DurationDays'].sum()
      )

Output for period = 'Y':

   ID Period DurationDays
0   1   2023      16 days
1   2   2023       3 days

Output for period = 'M':

   ID   Period DurationDays
0   1  2023-05      14 days
1   2  2023-05       3 days
2   1  2023-06       2 days

答案3

得分: 1

你可以使用lreshape重塑你的DataFrame,然后使用split_months(稍微调整一下):

tmp = (
    pd.lreshape(df, {'Datetime': ['Start', 'End']})
        .sort_values(by=['ID', 'Datetime'])
)

out = (
    tmp.groupby(['ID']).apply(
        lambda g: split_months(g['Datetime'].min(), g['Datetime'].max())
    ).reset_index('ID')
)

输出:

print(out)

   ID  Month  Year  DurationDays
1   1    May  2023         11.00
2   1   June  2023          3.00 # <-- 需要修复(应为 `2+3`)
1   2    May  2023          3.00
英文:

You can lreshape your DataFrame then use split_months (slightly adjusted) :

tmp = (
    pd.lreshape(df, {'Datetime': ['Start', 'End']})
        .sort_values(by=['ID', 'Datetime'])
)

out = (
    tmp.groupby(['ID']).apply(
        lambda g: split_months(g['Datetime'].min(), g['Datetime'].max())
    ).reset_index('ID')
)

Output :

print(out)

   ID Month  Year  DurationDays
1   1   May  2023         11.00
2   1  June  2023          3.00 # <-- needs to be fixed (it is `2+3`)
1   2   May  2023          3.00

huangapple
  • 本文由 发表于 2023年6月13日 14:57:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76462353.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定