英文:
Breakdown of time period (duration) to month/year
问题
| 初始数据表格如下所示:
编号 | 开始时间 | 结束时间 | 持续天数 |
---|---|---|---|
1 | 2023-05-20 12:00:00.000 | 2023-06-03 12:00:00.000 | 14 |
2 | 2023-05-20 12:00:00.000 | 2023-05-23 12:00:00.000 | 3 |
1 | 2023-06-01 12:00:00.000 | 2023-06-03 12:00:00.000 | 2 |
如何将其转换为以下的数据表格?问题在于行数相当多(数百万)。因此性能非常重要。
编号 | 年份 | 月份 | 持续天数 |
---|---|---|---|
1 | 2023 | 五月 | 11 |
2 | 2023 | 五月 | 3 |
1 | 2023 | 六月 | 5 |
更新: 请注意,可能会有多个月份分隔。例如:2023年2月20日(二月)和2023年12月18日(十二月)。
英文:
My initial dataframe looks as follows:
ID | Start | End | DurationDays |
---|---|---|---|
1 | 2023-05-20 12:00:00.000 | 2023-06-03 12:00:00.000 | 14 |
2 | 2023-05-20 12:00:00.000 | 2023-05-23 12:00:00.000 | 3 |
1 | 2023-06-01 12:00:00.000 | 2023-06-03 12:00:00.000 | 2 |
How to get from this to a dataframe like the following? The problem is, there are quite a low of rows (millions). Therefor performance is quite important.
ID | Year | Month | DurationDays |
---|---|---|---|
1 | 2023 | May | 11 |
2 | 2023 | May | 3 |
1 | 2023 | June | 5 |
Update: Please note, there could be more than one month to break. Example: 2023-02-20 (February) and 2023-12-18 (December)
答案1
得分: 1
以下是您要翻译的代码部分:
首先的想法是通过End/Start
列之间的天数差重复行,并按GroupBy.size
计算每个ID/Year/Month
的行数:
df[['Start', 'End']] = df[['Start', 'End']].apply(pd.to_datetime)
df = df.loc[df.index.repeat(df['End'].sub(df['Start']).dt.days)]
s = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df['Date'] = df['Start'].add(s)
df = (df.groupby(['ID', df['Date'].dt.year.rename('Year'),
df['Date'].dt.month_name().rename('Month')], sort=False)
.size()
.reset_index(name='DurationDays'))
print(df)
ID Year Month DurationDays
0 1 2023 May 12
1 1 2023 June 4
2 2 2023 May 3
为了更精确,可以使用小时而不是天数:
df[['Start', 'End']] = df[['Start', 'End']].apply(pd.to_datetime)
df = df.loc[df.index.repeat(df['End'].sub(df['Start']).dt.total_seconds().div(3600))]
s = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='h')
df['Start'] = df['Start'].add(s)
df = (df.assign(Year=df['Start'].dt.year,
Month=df['Start'].dt.month_name())
.groupby(['ID', 'Year','Month'], sort=False)
.size()
.div(24)
.reset_index(name='DurationDays'))
print(df)
ID Year Month DurationDays
0 1 2023 May 11.5
1 1 2023 June 4.5
2 2 2023 May 3.0
希望这有助于您理解代码。
英文:
First idea is repeat rows by difference of End/Start
columns in days and count number of rows per ID/Year/Month
by GroupBy.size
:
df[['Start','End']] = df[['Start','End']].apply(pd.to_datetime)
df = df.loc[df.index.repeat(df['End'].sub(df['Start']).dt.days)]
s = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df['Date'] = df['Start'].add(s)
df = (df.groupby(['ID', df['Date'].dt.year.rename('Year'),
df['Date'].dt.month_name().rename('Month')], sort=False)
.size()
.reset_index(name='DurationDays'))
print (df)
ID Year Month DurationDays
0 1 2023 May 12
1 1 2023 June 4
2 2 2023 May 3
For better precision is possible use hours instead days:
df[['Start','End']] = df[['Start','End']].apply(pd.to_datetime)
df = df.loc[df.index.repeat(df['End'].sub(df['Start']).dt.total_seconds().div(3600))]
s = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='h')
df['Start'] = df['Start'].add(s)
df = (df.assign(Year=df['Start'].dt.year,
Month=df['Start'].dt.month_name())
.groupby(['ID', 'Year','Month'], sort=False)
.size()
.div(24)
.reset_index(name='DurationDays'))
print (df)
ID Year Month DurationDays
0 1 2023 May 11.5
1 1 2023 June 4.5
2 2 2023 May 3.0
答案2
得分: 1
@jezrael的方法是正确的方向,但对于大型DataFrame来说可能不太现实,因为它需要按天/小时重复行(从而创建一个具有数百万或数千万行的中间结果)。
相反,似乎更合理的方法是使用pandas的Period
作为重复行的单位。
简而言之,这会使用repeat
在其边界处打破周期的行,并使用groupby.cumcount
增加周期,使用start_time
/end_time
获取边界,最后使用groupby.sum
来添加每个周期的持续时间。
我自愿添加了几个中间变量,以使代码更容易理解,但如果需要,许多步骤可以合并成一个步骤,以避免创建太多的中间列。
英文:
@jezrael's approach is on the right track, but probably unrealistic for a large DataFrame as it requires to repeat the rows by days/hours (thus creating an intermediate with tens or hundreds of millions of rows).
Instead, it seems more reasonable to use pandas' Period
as unit to repeat rows.
In short, this breaks the rows by period at their boundaries using repeat
and increments the periods with groupby.cumcount
, gets the boundaries with start_time
/end_time
, finally groupby.sum
to add the duration per period.
I voluntarily added several intermediate variables to make it easier to understand the code, but if needed many steps could be combined into one to avoid creating too many intermediate columns.
df[['Start', 'End']] = df[['Start', 'End']].apply(pd.to_datetime)
period = 'M'
out = (df.assign(start=df['Start'].dt.to_period(period),
end=df['End'].dt.to_period(period),
diff=lambda d: d['end'].sub(d['start']).apply(lambda x: x.n))
.loc[lambda d: d.index.repeat(d['diff']+1)]
.assign(n=lambda d: d.groupby(level=0).cumcount(),
Start=lambda d: d['Start'].mask(d.index.duplicated(), d['start'].add(d['n']).dt.start_time),
End=lambda d: d['End'].mask(d.index.duplicated(keep='last'), d['end'].add(d['n']).dt.start_time),
DurationDays=lambda d: d['End'].sub(d['Start']),
Year=lambda d: d['Start'].dt.year,
Month=lambda d: d['Start'].dt.month_name(),
)
.groupby(['ID', 'Year', 'Month'], as_index=False, sort=False)
['DurationDays'].sum()
)
Output:
ID Year Month DurationDays
0 1 2023 May 11 days 12:00:00
1 1 2023 June 4 days 12:00:00
2 2 2023 May 3 days 00:00:00
Intermediate before the groupby.sum
:
ID Start End DurationDays start end diff n Year Month
0 1 2023-05-20 12:00:00 2023-06-01 00:00:00 11 days 12:00:00 2023-05 2023-06 1 0 2023 May
0 1 2023-06-01 00:00:00 2023-06-03 12:00:00 2 days 12:00:00 2023-05 2023-06 1 1 2023 June
1 2 2023-05-20 12:00:00 2023-05-23 12:00:00 3 days 00:00:00 2023-05 2023-05 0 0 2023 May
2 1 2023-06-01 12:00:00 2023-06-03 12:00:00 2 days 00:00:00 2023-06 2023-06 0 0 2023 June
generalization
For a more generic method that you can use with any period (provided you have enough resources):
df[['Start', 'End']] = df[['Start', 'End']].apply(pd.to_datetime)
period = 'Y'
out = (df.assign(Period=df['Start'].dt.to_period(period),
end=df['End'].dt.to_period(period),
diff=lambda d: d['end'].sub(d['Period']).apply(lambda x: x.n))
.loc[lambda d: d.index.repeat(d['diff']+1)]
.assign(n=lambda d: d.groupby(level=0).cumcount(),
Start=lambda d: d['Start'].mask(d.index.duplicated(), d['Period'].add(d['n']).dt.start_time),
End=lambda d: d['End'].mask(d.index.duplicated(keep='last'), d['end'].add(d['n']).dt.start_time),
DurationDays=lambda d: d['End'].sub(d['Start']),
)
.groupby(['ID', 'Period'], as_index=False, sort=False)
['DurationDays'].sum()
)
Output for period = 'Y'
:
ID Period DurationDays
0 1 2023 16 days
1 2 2023 3 days
Output for period = 'M'
:
ID Period DurationDays
0 1 2023-05 14 days
1 2 2023-05 3 days
2 1 2023-06 2 days
答案3
得分: 1
你可以使用lreshape
重塑你的DataFrame,然后使用split_months
(稍微调整一下):
tmp = (
pd.lreshape(df, {'Datetime': ['Start', 'End']})
.sort_values(by=['ID', 'Datetime'])
)
out = (
tmp.groupby(['ID']).apply(
lambda g: split_months(g['Datetime'].min(), g['Datetime'].max())
).reset_index('ID')
)
输出:
print(out)
ID Month Year DurationDays
1 1 May 2023 11.00
2 1 June 2023 3.00 # <-- 需要修复(应为 `2+3`)
1 2 May 2023 3.00
英文:
You can lreshape
your DataFrame then use split_months
(slightly adjusted) :
tmp = (
pd.lreshape(df, {'Datetime': ['Start', 'End']})
.sort_values(by=['ID', 'Datetime'])
)
out = (
tmp.groupby(['ID']).apply(
lambda g: split_months(g['Datetime'].min(), g['Datetime'].max())
).reset_index('ID')
)
Output :
print(out)
ID Month Year DurationDays
1 1 May 2023 11.00
2 1 June 2023 3.00 # <-- needs to be fixed (it is `2+3`)
1 2 May 2023 3.00
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论