在主数据框中查找子数据框的索引并匹配

huangapple go评论68阅读模式
英文:

Finding the index of a sub dataframe and match in the main dataframe

问题

我有一个如下的数据框:

呼叫ID 存储日期 关闭日期 时间差
1 2023-02-08 14:35:09 2023-02-08 14:35:56
1 2023-02-08 14:35:56 2023-02-08 14:42:00
2 2023-02-07 10:17:18 2023-02-07 10:22:23
2 2023-02-07 10:22:23 2023-02-07 15:09:14
2 2023-02-07 15:09:14 2023-02-07 16:20:50
2 2023-02-07 16:20:49 2023-02-08 09:23:16
2 2023-02-08 09:23:16 2023-02-08 09:27:21
3 2023-03-10 10:31:25 2023-03-10 10:41:37
3 2023-03-10 10:41:37 2023-03-10 14:23:18

为了得到时间差,我正在执行以下操作:

delta_time = a.iloc[-1]['CloseDate'] - a.iloc[0]['StorageDate']

我需要减去每个呼叫ID的最后一个CloseDate和第一个StorageDate(总共16821次),delta_time必须放在每个呼叫ID的最后一行,其中有值(与我从中获取CloseDate的方式相同)。

我正在这样做:

callid = 1
while callid <= 16821:
    df1 = df1[df1['CallID'] == callid]
    delta_time = df1.iloc[-1]['CloseDate'] - df1.iloc[0]['StorageDate']
    callid += 1

但问题是我无法正确解析delta_time的值。

在之前的尝试中,我使用loc和iloc进行了尝试,并成功将其发送到df1中的正确行,结构如下:

delta_time = df1.iloc[-1]['CloseDate'] - df1.iloc[0]['StorageDate']
df1.loc[1, 'Time Delta'] = delta_time

它可以工作,但效率不高,因为我必须为每个不同的CallID更改loc中的值,而iloc[-1]似乎无法工作。此外,我不知道如何解析它到主数据框而不仅仅是我创建的用于执行数学运算的数据框。

有人可以在这方面帮助我吗?

英文:

I have a dataframe as below:

CallID StorageDate CloseDate Time Delta
1 2023-02-08 14:35:09 2023-02-08 14:35:56
1 2023-02-08 14:35:56 2023-02-08 14:42:00 value
2 2023-02-07 10:17:18 2023-02-07 10:22:23
2 2023-02-07 10:22:23 2023-02-07 15:09:14
2 2023-02-07 15:09:14 2023-02-07 16:20:50
2 2023-02-07 16:20:49 2023-02-08 09:23:16
2 2023-02-08 09:23:16 2023-02-08 09:27:21 value
3 2023-03-10 10:31:25 2023-03-10 10:41:37
3 2023-03-10 10:41:37 2023-03-10 14:23:18 value

To achieve the Time Delta, I am doing the following:

delta_time = a.iloc[-1][&#39;CloseDate&#39;] - a.iloc[0][&#39;StorageDate&#39;]

I need to subtract the last CloseDate from the first StorageDate for each CallID (a total of 16821), and the delta_time must go in the last row of each CallID, where there is value (the same as I get the CloseDate from).

I'm doing as follows:

callid = 1
while callid &lt;= 16821:
    df1 = df1[df1[&#39;CallID&#39;] == callid]
    delta_time = df1.iloc[-1][&#39;CloseDate&#39;] - df1.iloc[0][&#39;StorageDate&#39;]
    callid += 1

But the problem is that I'm not being abble to parse the delta_time value to the correct row.

Before I tried doing with loc and iloc, and I managed to send it to the correct row in df1 with the following structure:

 delta_time = df1.iloc[-1][&#39;CloseDate&#39;] - df1.iloc[0][&#39;StorageDate&#39;]
 df1.loc[1, &#39;Time Delta&#39;] = delta_time

It works, but it's unefficient since I have to change the value inside the loc for every different CallID and iloc[-1] doesn't seem to work. Moreover, I don't know how to parse it to the main dataframe and not only the one I created to do the math.

Can anybody help me here?

答案1

得分: 2

df[[&#39;StorageDate&#39;, &#39;CloseDate&#39;]] = df[[&#39;StorageDate&#39;, &#39;CloseDate&#39;]].apply(pd.to_datetime)

g = df.groupby(&#39;CallID&#39;)

df[&#39;Time Delta&#39;] = (g[&#39;CloseDate&#39;].transform(&#39;last&#39;)
                    .sub(g[&#39;StorageDate&#39;].transform(&#39;first&#39;))
                    .where(~df[&#39;CallID&#39;].duplicated(keep=&#39;last&#39;))
                   )
英文:

Use groupby.transform and where:

df[[&#39;StorageDate&#39;, &#39;CloseDate&#39;]] = df[[&#39;StorageDate&#39;, &#39;CloseDate&#39;]].apply(pd.to_datetime)

g = df.groupby(&#39;CallID&#39;)

df[&#39;Time Delta&#39;] = (g[&#39;CloseDate&#39;].transform(&#39;last&#39;)
                    .sub(g[&#39;StorageDate&#39;].transform(&#39;first&#39;))
                    .where(~df[&#39;CallID&#39;].duplicated(keep=&#39;last&#39;))
                   )

Output:

   CallID         StorageDate           CloseDate      Time Delta
0       1 2023-02-08 14:35:09 2023-02-08 14:35:56             NaT
1       1 2023-02-08 14:35:56 2023-02-08 14:42:00 0 days 00:06:51
2       2 2023-02-07 10:17:18 2023-02-07 10:22:23             NaT
3       2 2023-02-07 10:22:23 2023-02-07 15:09:14             NaT
4       2 2023-02-07 15:09:14 2023-02-07 16:20:50             NaT
5       2 2023-02-07 16:20:49 2023-02-08 09:23:16             NaT
6       2 2023-02-08 09:23:16 2023-02-08 09:27:21 0 days 23:10:03
7       3 2023-03-10 10:31:25 2023-03-10 10:41:37             NaT
8       3 2023-03-10 10:41:37 2023-03-10 14:23:18 0 days 03:51:53

Reproducible input:

df = pd.DataFrame({&#39;CallID&#39;: [1, 1, 2, 2, 2, 2, 2, 3, 3],
                   &#39;StorageDate&#39;: [&#39;2023-02-08 14:35:09&#39;, &#39;2023-02-08 14:35:56&#39;, &#39;2023-02-07 10:17:18&#39;, &#39;2023-02-07 10:22:23&#39;, &#39;2023-02-07 15:09:14&#39;, &#39;2023-02-07 16:20:49&#39;, &#39;2023-02-08 09:23:16&#39;, &#39;2023-03-10 10:31:25&#39;, &#39;2023-03-10 10:41:37&#39;],
                   &#39;CloseDate&#39;: [&#39;2023-02-08 14:35:56&#39;, &#39;2023-02-08 14:42:00&#39;, &#39;2023-02-07 10:22:23&#39;, &#39;2023-02-07 15:09:14&#39;, &#39;2023-02-07 16:20:50&#39;, &#39;2023-02-08 09:23:16&#39;, &#39;2023-02-08 09:27:21&#39;, &#39;2023-03-10 10:41:37&#39;, &#39;2023-03-10 14:23:18&#39;]})

df[[&#39;StorageDate&#39;, &#39;CloseDate&#39;]] = df[[&#39;StorageDate&#39;, &#39;CloseDate&#39;]].apply(pd.to_datetime)

答案2

得分: 1

使用Series.duplicated来过滤由GroupBy.transform生成的最后几行:

m = ~df['CallID'].duplicated(keep='last')

g = df.groupby('CallID')

df.loc[m, 'Time Delta'] = (g['CloseDate'].transform('last')[m]
                                         .sub(g['StorageDate'].transform('first')[m]))
print (df)
   CallID         StorageDate           CloseDate       Time Delta
0       1 2023-02-08 14:35:09 2023-02-08 14:35:56              NaN
1       1 2023-02-08 14:35:56 2023-02-08 14:42:00  0 days 00:06:51
2       2 2023-02-07 10:17:18 2023-02-07 10:22:23              NaN
3       2 2023-02-07 10:22:23 2023-02-07 15:09:14              NaN
4       2 2023-02-07 15:09:14 2023-02-07 16:20:50              NaN
5       2 2023-02-07 16:20:49 2023-02-08 09:23:16              NaN
6       2 2023-02-08 09:23:16 2023-02-08 09:27:21  0 days 23:10:03
7       3 2023-03-10 10:31:25 2023-03-10 10:41:37              NaN
8       3 2023-03-10 10:41:37 2023-03-10 14:23:18  0 days 03:51:53

另一种使用GroupBy.agg和映射差异的解决方案:

def f(x):
    ts = x.total_seconds()
    hours, remainder = divmod(ts, 3600)
    minutes, seconds = divmod(remainder, 60)
    return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))

m = ~df['CallID'].duplicated(keep='last')

df1 = df.groupby('CallID').agg({'CloseDate':'last', 'StorageDate':'first'})

df.loc[m, 'Time Delta'] = (df.loc[m, 'CallID'].map(df1['CloseDate'].sub(df1['StorageDate']))
                             .apply(f))
print (df)
   CallID         StorageDate           CloseDate Time Delta
0       1 2023-02-08 14:35:09 2023-02-08 14:35:56        NaN
1       1 2023-02-08 14:35:56 2023-02-08 14:42:00   00:06:51
2       2 2023-02-07 10:17:18 2023-02-07 10:22:23        NaN
3       2 2023-02-07 10:22:23 2023-02-07 15:09:14        NaN
4       2 2023-02-07 15:09:14 2023-02-07 16:20:50        NaN
5       2 2023-02-07 16:20:49 2023-02-08 09:23:16        NaN
6       2 2023-02-08 09:23:16 2023-02-08 09:27:21   23:10:03
7       3 2023-03-10 10:31:25 2023-03-10 10:41:37        NaN
8       3 2023-03-10 10:41:37 2023-03-10 14:23:18   03:51:53
英文:

Use Series.duplicated for filter last rows generated by GroupBy.transform:

m = ~df[&#39;CallID&#39;].duplicated(keep=&#39;last&#39;)
g = df.groupby(&#39;CallID&#39;)
df.loc[m, &#39;Time Delta&#39;] = (g[&#39;CloseDate&#39;].transform(&#39;last&#39;)[m]
.sub(g[&#39;StorageDate&#39;].transform(&#39;first&#39;)[m]))
print (df)
CallID         StorageDate           CloseDate       Time Delta
0       1 2023-02-08 14:35:09 2023-02-08 14:35:56              NaN
1       1 2023-02-08 14:35:56 2023-02-08 14:42:00  0 days 00:06:51
2       2 2023-02-07 10:17:18 2023-02-07 10:22:23              NaN
3       2 2023-02-07 10:22:23 2023-02-07 15:09:14              NaN
4       2 2023-02-07 15:09:14 2023-02-07 16:20:50              NaN
5       2 2023-02-07 16:20:49 2023-02-08 09:23:16              NaN
6       2 2023-02-08 09:23:16 2023-02-08 09:27:21  0 days 23:10:03
7       3 2023-03-10 10:31:25 2023-03-10 10:41:37              NaN
8       3 2023-03-10 10:41:37 2023-03-10 14:23:18  0 days 03:51:53

Another solution with aggregate by GroupBy.agg with mapping difference:

def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return (&#39;{:02d}:{:02d}:{:02d}&#39;).format(int(hours), int(minutes), int(seconds))
m = ~df[&#39;CallID&#39;].duplicated(keep=&#39;last&#39;)
df1 = df.groupby(&#39;CallID&#39;).agg({&#39;CloseDate&#39;:&#39;last&#39;, &#39;StorageDate&#39;:&#39;first&#39;})
df.loc[m, &#39;Time Delta&#39;] = (df.loc[m, &#39;CallID&#39;].map(df1[&#39;CloseDate&#39;].sub(df1[&#39;StorageDate&#39;]))
.apply(f))
print (df)
CallID         StorageDate           CloseDate Time Delta
0       1 2023-02-08 14:35:09 2023-02-08 14:35:56        NaN
1       1 2023-02-08 14:35:56 2023-02-08 14:42:00   00:06:51
2       2 2023-02-07 10:17:18 2023-02-07 10:22:23        NaN
3       2 2023-02-07 10:22:23 2023-02-07 15:09:14        NaN
4       2 2023-02-07 15:09:14 2023-02-07 16:20:50        NaN
5       2 2023-02-07 16:20:49 2023-02-08 09:23:16        NaN
6       2 2023-02-08 09:23:16 2023-02-08 09:27:21   23:10:03
7       3 2023-03-10 10:31:25 2023-03-10 10:41:37        NaN
8       3 2023-03-10 10:41:37 2023-03-10 14:23:18   03:51:53

huangapple
  • 本文由 发表于 2023年3月15日 21:01:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/75745095.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定