英文:
Finding the index of a sub dataframe and match in the main dataframe
问题
我有一个如下的数据框:
呼叫ID | 存储日期 | 关闭日期 | 时间差 |
---|---|---|---|
1 | 2023-02-08 14:35:09 | 2023-02-08 14:35:56 | |
1 | 2023-02-08 14:35:56 | 2023-02-08 14:42:00 | 值 |
2 | 2023-02-07 10:17:18 | 2023-02-07 10:22:23 | |
2 | 2023-02-07 10:22:23 | 2023-02-07 15:09:14 | |
2 | 2023-02-07 15:09:14 | 2023-02-07 16:20:50 | |
2 | 2023-02-07 16:20:49 | 2023-02-08 09:23:16 | |
2 | 2023-02-08 09:23:16 | 2023-02-08 09:27:21 | 值 |
3 | 2023-03-10 10:31:25 | 2023-03-10 10:41:37 | |
3 | 2023-03-10 10:41:37 | 2023-03-10 14:23:18 | 值 |
为了得到时间差,我正在执行以下操作:
delta_time = a.iloc[-1]['CloseDate'] - a.iloc[0]['StorageDate']
我需要减去每个呼叫ID的最后一个CloseDate和第一个StorageDate(总共16821次),delta_time必须放在每个呼叫ID的最后一行,其中有值(与我从中获取CloseDate的方式相同)。
我正在这样做:
callid = 1
while callid <= 16821:
df1 = df1[df1['CallID'] == callid]
delta_time = df1.iloc[-1]['CloseDate'] - df1.iloc[0]['StorageDate']
callid += 1
但问题是我无法正确解析delta_time的值。
在之前的尝试中,我使用loc和iloc进行了尝试,并成功将其发送到df1中的正确行,结构如下:
delta_time = df1.iloc[-1]['CloseDate'] - df1.iloc[0]['StorageDate']
df1.loc[1, 'Time Delta'] = delta_time
它可以工作,但效率不高,因为我必须为每个不同的CallID更改loc中的值,而iloc[-1]似乎无法工作。此外,我不知道如何解析它到主数据框而不仅仅是我创建的用于执行数学运算的数据框。
有人可以在这方面帮助我吗?
英文:
I have a dataframe as below:
CallID | StorageDate | CloseDate | Time Delta |
---|---|---|---|
1 | 2023-02-08 14:35:09 | 2023-02-08 14:35:56 | |
1 | 2023-02-08 14:35:56 | 2023-02-08 14:42:00 | value |
2 | 2023-02-07 10:17:18 | 2023-02-07 10:22:23 | |
2 | 2023-02-07 10:22:23 | 2023-02-07 15:09:14 | |
2 | 2023-02-07 15:09:14 | 2023-02-07 16:20:50 | |
2 | 2023-02-07 16:20:49 | 2023-02-08 09:23:16 | |
2 | 2023-02-08 09:23:16 | 2023-02-08 09:27:21 | value |
3 | 2023-03-10 10:31:25 | 2023-03-10 10:41:37 | |
3 | 2023-03-10 10:41:37 | 2023-03-10 14:23:18 | value |
To achieve the Time Delta, I am doing the following:
delta_time = a.iloc[-1]['CloseDate'] - a.iloc[0]['StorageDate']
I need to subtract the last CloseDate from the first StorageDate for each CallID (a total of 16821), and the delta_time must go in the last row of each CallID, where there is value (the same as I get the CloseDate from).
I'm doing as follows:
callid = 1
while callid <= 16821:
df1 = df1[df1['CallID'] == callid]
delta_time = df1.iloc[-1]['CloseDate'] - df1.iloc[0]['StorageDate']
callid += 1
But the problem is that I'm not being abble to parse the delta_time value to the correct row.
Before I tried doing with loc and iloc, and I managed to send it to the correct row in df1 with the following structure:
delta_time = df1.iloc[-1]['CloseDate'] - df1.iloc[0]['StorageDate']
df1.loc[1, 'Time Delta'] = delta_time
It works, but it's unefficient since I have to change the value inside the loc for every different CallID and iloc[-1] doesn't seem to work. Moreover, I don't know how to parse it to the main dataframe and not only the one I created to do the math.
Can anybody help me here?
答案1
得分: 2
df[['StorageDate', 'CloseDate']] = df[['StorageDate', 'CloseDate']].apply(pd.to_datetime)
g = df.groupby('CallID')
df['Time Delta'] = (g['CloseDate'].transform('last')
.sub(g['StorageDate'].transform('first'))
.where(~df['CallID'].duplicated(keep='last'))
)
英文:
Use groupby.transform
and where
:
df[['StorageDate', 'CloseDate']] = df[['StorageDate', 'CloseDate']].apply(pd.to_datetime)
g = df.groupby('CallID')
df['Time Delta'] = (g['CloseDate'].transform('last')
.sub(g['StorageDate'].transform('first'))
.where(~df['CallID'].duplicated(keep='last'))
)
Output:
CallID StorageDate CloseDate Time Delta
0 1 2023-02-08 14:35:09 2023-02-08 14:35:56 NaT
1 1 2023-02-08 14:35:56 2023-02-08 14:42:00 0 days 00:06:51
2 2 2023-02-07 10:17:18 2023-02-07 10:22:23 NaT
3 2 2023-02-07 10:22:23 2023-02-07 15:09:14 NaT
4 2 2023-02-07 15:09:14 2023-02-07 16:20:50 NaT
5 2 2023-02-07 16:20:49 2023-02-08 09:23:16 NaT
6 2 2023-02-08 09:23:16 2023-02-08 09:27:21 0 days 23:10:03
7 3 2023-03-10 10:31:25 2023-03-10 10:41:37 NaT
8 3 2023-03-10 10:41:37 2023-03-10 14:23:18 0 days 03:51:53
Reproducible input:
df = pd.DataFrame({'CallID': [1, 1, 2, 2, 2, 2, 2, 3, 3],
'StorageDate': ['2023-02-08 14:35:09', '2023-02-08 14:35:56', '2023-02-07 10:17:18', '2023-02-07 10:22:23', '2023-02-07 15:09:14', '2023-02-07 16:20:49', '2023-02-08 09:23:16', '2023-03-10 10:31:25', '2023-03-10 10:41:37'],
'CloseDate': ['2023-02-08 14:35:56', '2023-02-08 14:42:00', '2023-02-07 10:22:23', '2023-02-07 15:09:14', '2023-02-07 16:20:50', '2023-02-08 09:23:16', '2023-02-08 09:27:21', '2023-03-10 10:41:37', '2023-03-10 14:23:18']})
df[['StorageDate', 'CloseDate']] = df[['StorageDate', 'CloseDate']].apply(pd.to_datetime)
答案2
得分: 1
使用Series.duplicated
来过滤由GroupBy.transform
生成的最后几行:
m = ~df['CallID'].duplicated(keep='last')
g = df.groupby('CallID')
df.loc[m, 'Time Delta'] = (g['CloseDate'].transform('last')[m]
.sub(g['StorageDate'].transform('first')[m]))
print (df)
CallID StorageDate CloseDate Time Delta
0 1 2023-02-08 14:35:09 2023-02-08 14:35:56 NaN
1 1 2023-02-08 14:35:56 2023-02-08 14:42:00 0 days 00:06:51
2 2 2023-02-07 10:17:18 2023-02-07 10:22:23 NaN
3 2 2023-02-07 10:22:23 2023-02-07 15:09:14 NaN
4 2 2023-02-07 15:09:14 2023-02-07 16:20:50 NaN
5 2 2023-02-07 16:20:49 2023-02-08 09:23:16 NaN
6 2 2023-02-08 09:23:16 2023-02-08 09:27:21 0 days 23:10:03
7 3 2023-03-10 10:31:25 2023-03-10 10:41:37 NaN
8 3 2023-03-10 10:41:37 2023-03-10 14:23:18 0 days 03:51:53
另一种使用GroupBy.agg
和映射差异的解决方案:
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
m = ~df['CallID'].duplicated(keep='last')
df1 = df.groupby('CallID').agg({'CloseDate':'last', 'StorageDate':'first'})
df.loc[m, 'Time Delta'] = (df.loc[m, 'CallID'].map(df1['CloseDate'].sub(df1['StorageDate']))
.apply(f))
print (df)
CallID StorageDate CloseDate Time Delta
0 1 2023-02-08 14:35:09 2023-02-08 14:35:56 NaN
1 1 2023-02-08 14:35:56 2023-02-08 14:42:00 00:06:51
2 2 2023-02-07 10:17:18 2023-02-07 10:22:23 NaN
3 2 2023-02-07 10:22:23 2023-02-07 15:09:14 NaN
4 2 2023-02-07 15:09:14 2023-02-07 16:20:50 NaN
5 2 2023-02-07 16:20:49 2023-02-08 09:23:16 NaN
6 2 2023-02-08 09:23:16 2023-02-08 09:27:21 23:10:03
7 3 2023-03-10 10:31:25 2023-03-10 10:41:37 NaN
8 3 2023-03-10 10:41:37 2023-03-10 14:23:18 03:51:53
英文:
Use Series.duplicated
for filter last rows generated by GroupBy.transform
:
m = ~df['CallID'].duplicated(keep='last')
g = df.groupby('CallID')
df.loc[m, 'Time Delta'] = (g['CloseDate'].transform('last')[m]
.sub(g['StorageDate'].transform('first')[m]))
print (df)
CallID StorageDate CloseDate Time Delta
0 1 2023-02-08 14:35:09 2023-02-08 14:35:56 NaN
1 1 2023-02-08 14:35:56 2023-02-08 14:42:00 0 days 00:06:51
2 2 2023-02-07 10:17:18 2023-02-07 10:22:23 NaN
3 2 2023-02-07 10:22:23 2023-02-07 15:09:14 NaN
4 2 2023-02-07 15:09:14 2023-02-07 16:20:50 NaN
5 2 2023-02-07 16:20:49 2023-02-08 09:23:16 NaN
6 2 2023-02-08 09:23:16 2023-02-08 09:27:21 0 days 23:10:03
7 3 2023-03-10 10:31:25 2023-03-10 10:41:37 NaN
8 3 2023-03-10 10:41:37 2023-03-10 14:23:18 0 days 03:51:53
Another solution with aggregate by GroupBy.agg
with mapping difference:
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
m = ~df['CallID'].duplicated(keep='last')
df1 = df.groupby('CallID').agg({'CloseDate':'last', 'StorageDate':'first'})
df.loc[m, 'Time Delta'] = (df.loc[m, 'CallID'].map(df1['CloseDate'].sub(df1['StorageDate']))
.apply(f))
print (df)
CallID StorageDate CloseDate Time Delta
0 1 2023-02-08 14:35:09 2023-02-08 14:35:56 NaN
1 1 2023-02-08 14:35:56 2023-02-08 14:42:00 00:06:51
2 2 2023-02-07 10:17:18 2023-02-07 10:22:23 NaN
3 2 2023-02-07 10:22:23 2023-02-07 15:09:14 NaN
4 2 2023-02-07 15:09:14 2023-02-07 16:20:50 NaN
5 2 2023-02-07 16:20:49 2023-02-08 09:23:16 NaN
6 2 2023-02-08 09:23:16 2023-02-08 09:27:21 23:10:03
7 3 2023-03-10 10:31:25 2023-03-10 10:41:37 NaN
8 3 2023-03-10 10:41:37 2023-03-10 14:23:18 03:51:53
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论