2023年3月15日 21:01:44go评论68阅读模式

英文:

Finding the index of a sub dataframe and match in the main dataframe

问题

我有一个如下的数据框：

呼叫ID	存储日期	关闭日期	时间差
1	2023-02-08 14:35:09	2023-02-08 14:35:56
1	2023-02-08 14:35:56	2023-02-08 14:42:00	值
2	2023-02-07 10:17:18	2023-02-07 10:22:23
2	2023-02-07 10:22:23	2023-02-07 15:09:14
2	2023-02-07 15:09:14	2023-02-07 16:20:50
2	2023-02-07 16:20:49	2023-02-08 09:23:16
2	2023-02-08 09:23:16	2023-02-08 09:27:21	值
3	2023-03-10 10:31:25	2023-03-10 10:41:37
3	2023-03-10 10:41:37	2023-03-10 14:23:18	值

为了得到时间差，我正在执行以下操作：

delta_time = a.iloc[-1]['CloseDate'] - a.iloc[0]['StorageDate']

我需要减去每个呼叫ID的最后一个CloseDate和第一个StorageDate（总共16821次），delta_time必须放在每个呼叫ID的最后一行，其中有值（与我从中获取CloseDate的方式相同）。

我正在这样做：

callid = 1
while callid <= 16821:
    df1 = df1[df1['CallID'] == callid]
    delta_time = df1.iloc[-1]['CloseDate'] - df1.iloc[0]['StorageDate']
    callid += 1

但问题是我无法正确解析delta_time的值。

在之前的尝试中，我使用loc和iloc进行了尝试，并成功将其发送到df1中的正确行，结构如下：

delta_time = df1.iloc[-1]['CloseDate'] - df1.iloc[0]['StorageDate']
df1.loc[1, 'Time Delta'] = delta_time

它可以工作，但效率不高，因为我必须为每个不同的CallID更改loc中的值，而iloc[-1]似乎无法工作。此外，我不知道如何解析它到主数据框而不仅仅是我创建的用于执行数学运算的数据框。

有人可以在这方面帮助我吗？

英文:

I have a dataframe as below:

CallID	StorageDate	CloseDate	Time Delta
1	2023-02-08 14:35:09	2023-02-08 14:35:56
1	2023-02-08 14:35:56	2023-02-08 14:42:00	value
2	2023-02-07 10:17:18	2023-02-07 10:22:23
2	2023-02-07 10:22:23	2023-02-07 15:09:14
2	2023-02-07 15:09:14	2023-02-07 16:20:50
2	2023-02-07 16:20:49	2023-02-08 09:23:16
2	2023-02-08 09:23:16	2023-02-08 09:27:21	value
3	2023-03-10 10:31:25	2023-03-10 10:41:37
3	2023-03-10 10:41:37	2023-03-10 14:23:18	value

To achieve the Time Delta, I am doing the following:

delta_time = a.iloc[-1][&#39;CloseDate&#39;] - a.iloc[0][&#39;StorageDate&#39;]

I need to subtract the last CloseDate from the first StorageDate for each CallID (a total of 16821), and the delta_time must go in the last row of each CallID, where there is value (the same as I get the CloseDate from).

I'm doing as follows:

callid = 1
while callid &lt;= 16821:
    df1 = df1[df1[&#39;CallID&#39;] == callid]
    delta_time = df1.iloc[-1][&#39;CloseDate&#39;] - df1.iloc[0][&#39;StorageDate&#39;]
    callid += 1

But the problem is that I'm not being abble to parse the delta_time value to the correct row.

Before I tried doing with loc and iloc, and I managed to send it to the correct row in df1 with the following structure:

 delta_time = df1.iloc[-1][&#39;CloseDate&#39;] - df1.iloc[0][&#39;StorageDate&#39;]
 df1.loc[1, &#39;Time Delta&#39;] = delta_time

It works, but it's unefficient since I have to change the value inside the loc for every different CallID and iloc[-1] doesn't seem to work. Moreover, I don't know how to parse it to the main dataframe and not only the one I created to do the math.

Can anybody help me here?

答案1

得分: 2

df[[&#39;StorageDate&#39;, &#39;CloseDate&#39;]] = df[[&#39;StorageDate&#39;, &#39;CloseDate&#39;]].apply(pd.to_datetime)

g = df.groupby(&#39;CallID&#39;)

df[&#39;Time Delta&#39;] = (g[&#39;CloseDate&#39;].transform(&#39;last&#39;)
                    .sub(g[&#39;StorageDate&#39;].transform(&#39;first&#39;))
                    .where(~df[&#39;CallID&#39;].duplicated(keep=&#39;last&#39;))
                   )

英文:

Use groupby.transform and where:

df[[&#39;StorageDate&#39;, &#39;CloseDate&#39;]] = df[[&#39;StorageDate&#39;, &#39;CloseDate&#39;]].apply(pd.to_datetime)

g = df.groupby(&#39;CallID&#39;)

df[&#39;Time Delta&#39;] = (g[&#39;CloseDate&#39;].transform(&#39;last&#39;)
                    .sub(g[&#39;StorageDate&#39;].transform(&#39;first&#39;))
                    .where(~df[&#39;CallID&#39;].duplicated(keep=&#39;last&#39;))
                   )

Output:

   CallID         StorageDate           CloseDate      Time Delta
0       1 2023-02-08 14:35:09 2023-02-08 14:35:56             NaT
1       1 2023-02-08 14:35:56 2023-02-08 14:42:00 0 days 00:06:51
2       2 2023-02-07 10:17:18 2023-02-07 10:22:23             NaT
3       2 2023-02-07 10:22:23 2023-02-07 15:09:14             NaT
4       2 2023-02-07 15:09:14 2023-02-07 16:20:50             NaT
5       2 2023-02-07 16:20:49 2023-02-08 09:23:16             NaT
6       2 2023-02-08 09:23:16 2023-02-08 09:27:21 0 days 23:10:03
7       3 2023-03-10 10:31:25 2023-03-10 10:41:37             NaT
8       3 2023-03-10 10:41:37 2023-03-10 14:23:18 0 days 03:51:53

Reproducible input:

df = pd.DataFrame({&#39;CallID&#39;: [1, 1, 2, 2, 2, 2, 2, 3, 3],
                   &#39;StorageDate&#39;: [&#39;2023-02-08 14:35:09&#39;, &#39;2023-02-08 14:35:56&#39;, &#39;2023-02-07 10:17:18&#39;, &#39;2023-02-07 10:22:23&#39;, &#39;2023-02-07 15:09:14&#39;, &#39;2023-02-07 16:20:49&#39;, &#39;2023-02-08 09:23:16&#39;, &#39;2023-03-10 10:31:25&#39;, &#39;2023-03-10 10:41:37&#39;],
                   &#39;CloseDate&#39;: [&#39;2023-02-08 14:35:56&#39;, &#39;2023-02-08 14:42:00&#39;, &#39;2023-02-07 10:22:23&#39;, &#39;2023-02-07 15:09:14&#39;, &#39;2023-02-07 16:20:50&#39;, &#39;2023-02-08 09:23:16&#39;, &#39;2023-02-08 09:27:21&#39;, &#39;2023-03-10 10:41:37&#39;, &#39;2023-03-10 14:23:18&#39;]})

df[[&#39;StorageDate&#39;, &#39;CloseDate&#39;]] = df[[&#39;StorageDate&#39;, &#39;CloseDate&#39;]].apply(pd.to_datetime)

答案2

得分: 1

使用Series.duplicated来过滤由GroupBy.transform生成的最后几行：

m = ~df['CallID'].duplicated(keep='last')

g = df.groupby('CallID')

df.loc[m, 'Time Delta'] = (g['CloseDate'].transform('last')[m]
                                         .sub(g['StorageDate'].transform('first')[m]))
print (df)
   CallID         StorageDate           CloseDate       Time Delta
0       1 2023-02-08 14:35:09 2023-02-08 14:35:56              NaN
1       1 2023-02-08 14:35:56 2023-02-08 14:42:00  0 days 00:06:51
2       2 2023-02-07 10:17:18 2023-02-07 10:22:23              NaN
3       2 2023-02-07 10:22:23 2023-02-07 15:09:14              NaN
4       2 2023-02-07 15:09:14 2023-02-07 16:20:50              NaN
5       2 2023-02-07 16:20:49 2023-02-08 09:23:16              NaN
6       2 2023-02-08 09:23:16 2023-02-08 09:27:21  0 days 23:10:03
7       3 2023-03-10 10:31:25 2023-03-10 10:41:37              NaN
8       3 2023-03-10 10:41:37 2023-03-10 14:23:18  0 days 03:51:53

另一种使用GroupBy.agg和映射差异的解决方案：

def f(x):
    ts = x.total_seconds()
    hours, remainder = divmod(ts, 3600)
    minutes, seconds = divmod(remainder, 60)
    return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))

m = ~df['CallID'].duplicated(keep='last')

df1 = df.groupby('CallID').agg({'CloseDate':'last', 'StorageDate':'first'})

df.loc[m, 'Time Delta'] = (df.loc[m, 'CallID'].map(df1['CloseDate'].sub(df1['StorageDate']))
                             .apply(f))
print (df)
   CallID         StorageDate           CloseDate Time Delta
0       1 2023-02-08 14:35:09 2023-02-08 14:35:56        NaN
1       1 2023-02-08 14:35:56 2023-02-08 14:42:00   00:06:51
2       2 2023-02-07 10:17:18 2023-02-07 10:22:23        NaN
3       2 2023-02-07 10:22:23 2023-02-07 15:09:14        NaN
4       2 2023-02-07 15:09:14 2023-02-07 16:20:50        NaN
5       2 2023-02-07 16:20:49 2023-02-08 09:23:16        NaN
6       2 2023-02-08 09:23:16 2023-02-08 09:27:21   23:10:03
7       3 2023-03-10 10:31:25 2023-03-10 10:41:37        NaN
8       3 2023-03-10 10:41:37 2023-03-10 14:23:18   03:51:53

英文:

Use Series.duplicated for filter last rows generated by GroupBy.transform:

m = ~df[&#39;CallID&#39;].duplicated(keep=&#39;last&#39;)
g = df.groupby(&#39;CallID&#39;)
df.loc[m, &#39;Time Delta&#39;] = (g[&#39;CloseDate&#39;].transform(&#39;last&#39;)[m]
.sub(g[&#39;StorageDate&#39;].transform(&#39;first&#39;)[m]))
print (df)
CallID         StorageDate           CloseDate       Time Delta
0       1 2023-02-08 14:35:09 2023-02-08 14:35:56              NaN
1       1 2023-02-08 14:35:56 2023-02-08 14:42:00  0 days 00:06:51
2       2 2023-02-07 10:17:18 2023-02-07 10:22:23              NaN
3       2 2023-02-07 10:22:23 2023-02-07 15:09:14              NaN
4       2 2023-02-07 15:09:14 2023-02-07 16:20:50              NaN
5       2 2023-02-07 16:20:49 2023-02-08 09:23:16              NaN
6       2 2023-02-08 09:23:16 2023-02-08 09:27:21  0 days 23:10:03
7       3 2023-03-10 10:31:25 2023-03-10 10:41:37              NaN
8       3 2023-03-10 10:41:37 2023-03-10 14:23:18  0 days 03:51:53

Another solution with aggregate by GroupBy.agg with mapping difference:

def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return (&#39;{:02d}:{:02d}:{:02d}&#39;).format(int(hours), int(minutes), int(seconds))
m = ~df[&#39;CallID&#39;].duplicated(keep=&#39;last&#39;)
df1 = df.groupby(&#39;CallID&#39;).agg({&#39;CloseDate&#39;:&#39;last&#39;, &#39;StorageDate&#39;:&#39;first&#39;})
df.loc[m, &#39;Time Delta&#39;] = (df.loc[m, &#39;CallID&#39;].map(df1[&#39;CloseDate&#39;].sub(df1[&#39;StorageDate&#39;]))
.apply(f))
print (df)
CallID         StorageDate           CloseDate Time Delta
0       1 2023-02-08 14:35:09 2023-02-08 14:35:56        NaN
1       1 2023-02-08 14:35:56 2023-02-08 14:42:00   00:06:51
2       2 2023-02-07 10:17:18 2023-02-07 10:22:23        NaN
3       2 2023-02-07 10:22:23 2023-02-07 15:09:14        NaN
4       2 2023-02-07 15:09:14 2023-02-07 16:20:50        NaN
5       2 2023-02-07 16:20:49 2023-02-08 09:23:16        NaN
6       2 2023-02-08 09:23:16 2023-02-08 09:27:21   23:10:03
7       3 2023-03-10 10:31:25 2023-03-10 10:41:37        NaN
8       3 2023-03-10 10:41:37 2023-03-10 14:23:18   03:51:53

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在主数据框中查找子数据框的索引并匹配

问题

答案1

答案2

从Pandas DataFrame提取数据

如何在地图上绘制具有不同颜色网格的热图？

how to make pandas row value to zero when row above values are zeros and below value not equal to zero using python pandas

如何在pyplot图表上使用千位分隔符

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论