英文:
How to generate pandas dataframe timedelta column grouped by id and date (YYYY-MM-DD)?
问题
id datetime datetime_baseline timedelta
a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156 0
a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156 719
a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123 0
a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123 797
a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147 0
a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147 119
a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147 0
a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147 648
a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147 0
a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147 119
a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147 0
a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147 648
英文:
Suppose I have a dataframe with id
and datetime
columns:
df = pd.DataFrame({"id": ["a1", "a1", "a1", "a1", "a2", "a2", "a2", "a2", "a3", "a3", "a3", "a3"],
"datetime": ["2016-01-01 00:01:00.156",
"2016-01-01 12:00:00.425",
"2016-01-02 00:59:00.123",
"2016-01-02 14:16:00.548",
"2016-01-01 12:00:00.147",
"2016-01-01 13:59:00.123",
"2016-01-02 08:01:00.147",
"2016-01-02 18:49:00.123",
"2016-02-01 12:00:00.147",
"2016-02-01 13:59:00.123",
"2016-02-02 08:01:00.147",
"2016-02-02 18:49:00.123"]})
df["datetime"] = pd.to_datetime(df["datetime"])
df
Here is the dataframe:
id datetime
0 a1 2016-01-01 00:01:00.156
1 a1 2016-01-01 12:00:00.425
2 a1 2016-01-02 00:59:00.123
3 a1 2016-01-02 14:16:00.548
4 a2 2016-01-01 12:00:00.147
5 a2 2016-01-01 13:59:00.123
6 a2 2016-01-02 08:01:00.147
7 a2 2016-01-02 18:49:00.123
8 a3 2016-02-01 12:00:00.147
9 a3 2016-02-01 13:59:00.123
10 a3 2016-02-02 08:01:00.147
11 a3 2016-02-02 18:49:00.123
I want to generate column timedelta
that has a timedelta value. This is the output I expect to get:
id datetime datetime_baseline timedelta
0 a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156 0
1 a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156 719
2 a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123 0
3 a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123 797
4 a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147 0
5 a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147 119
6 a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147 0
7 a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147 648
8 a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147 0
9 a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147 119
10 a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147 0
11 a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147 648
Here is how the timedelta
values should be calculated: 1) the code needs to identify the FIRST datetime within the same id
and date ('YYYY-MM-DD'), and 2) use it as baseline (datetime_baseline
) to compute the timedelta (in minutes) w.r.t. other datetimes within same id
and same date. For id
='a1' and date='2016-01-01', the datetime_baseline
='2016-01-01 00:01:00.156'. So, at index=0, timedelta
has value=0 because '2016-01-01 00:01:00.156' - datetime_baseline
=0. Meanwhile, at index=1, timedelta
has value=719 because '2016-01-01 12:00:00.425' - datetime_baseline
=719 (minutes). At index=2, id
is the same as before but date is now '2016-01-02', so a new baseline will be used: '2016-01-02 00:59:00.123'. timedelta
='2016-01-02 00:59:00.123' - datetime_baseline
=0. At index=3, timedelta
='2016-01-02 14:16:00.548' - datetime_baseline
=797.
Although I see how the timedelta
values should be calculated (timedelta=datetime-datetime_baseline
), I don't know how to have the baseline values identified (i.e. how to generate datetime_baseline
column). Please, let me know if you need any further explanation.
ps> the actual dataframe has +500 thousand rows.
答案1
得分: 2
使用GroupBy.transform
进行基线计算:
df["datetime_baseline"] = (df.groupby(["id", df["datetime"].dt.date])
["datetime"].transform("first"))
并使用dt.total_seconds
计算时间差:
df["timedelta"] = ((df["datetime"].sub(df["datetime_baseline"]))
.dt.total_seconds().div(60).round(0).astype(int))
输出:
print(df)
id datetime datetime_baseline timedelta
0 a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156 0
1 a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156 719
2 a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123 0
3 a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123 797
4 a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147 0
5 a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147 119
6 a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147 0
7 a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147 648
8 a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147 0
9 a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147 119
10 a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147 0
11 a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147 648
英文:
With GroupBy.transform
to make the baseline :
df["datetime_baseline"] = (df.groupby(["id", df["datetime"].dt.date])
["datetime"].transform("first"))
And dt.total_seconds
to compute the timedelta :
df["timedelta"] = ((df["datetime"].sub(df["datetime_baseline"]))
.dt.total_seconds().div(60).round(0).astype(int))
Output :
print(df)
id datetime datetime_baseline timedelta
0 a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156 0
1 a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156 719
2 a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123 0
3 a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123 797
4 a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147 0
5 a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147 119
6 a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147 0
7 a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147 648
8 a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147 0
9 a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147 119
10 a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147 0
11 a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147 648
答案2
得分: 1
df['datetime_baseline'] = df.groupby(['id', df['datetime'].dt.date])["datetime"].transform('min')
df['timedelta'] = np.round((df['datetime'] - df['datetime_baseline']).dt.seconds / 60)
print(df)
英文:
Try:
df['datetime_baseline'] = df.groupby(['id', df['datetime'].dt.date])["datetime"].transform('min')
df['timedelta'] = np.round((df['datetime'] - df['datetime_baseline']).dt.seconds / 60)
print(df)
Prints:
id datetime datetime_baseline timedelta
0 a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156 0.0
1 a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156 719.0
2 a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123 0.0
3 a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123 797.0
4 a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147 0.0
5 a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147 119.0
6 a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147 0.0
7 a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147 648.0
8 a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147 0.0
9 a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147 119.0
10 a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147 0.0
11 a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147 648.0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论