2023年3月4日 03:00:27go评论75阅读模式

英文:

Compute a combined difference of two columns and a running difference in a column

问题

如果存在重复的ID，则Diff是下一个End_Date减去前一个End_Date，而对于最后一个重复的ID，Diff也是End_Date减去Start_Date。否则，Diff也是End_Date减去Start_Date。

我的数据集如下所示：

df = 

Index  ID	Start_Date	End_Date
0	 118645	2021-01-04	2021-04-28
1	 118985	2021-01-11	2022-01-24
2	 119023	2021-01-07	2021-09-08
3	 119225	2021-01-08	2021-04-11
4	 119225	2021-01-08	2021-04-11
5	 119276	2021-01-07	2021-03-16
6	 119863	2021-01-11	2021-03-25
7	 119924	2021-01-13	2021-09-06
8	 119924	2021-01-13	2021-11-09
9	 119924	2021-01-13	2022-05-23
10	 119924	2021-01-13	2022-11-10
11	 119987	2021-01-12	2021-02-23

我对这个问题的解决方法如下：

df['Diff'] = np.where(df.ID == df.ID.shift(), (pd.to_datetime(df["End_Date"]) - pd.to_datetime(df["End_Date"]).shift()) // np.timedelta64(1, 'D'), None)

df['Diff'] = np.where(df.ID != df.ID.shift(), (pd.to_datetime(df["End_Date"]) - pd.to_datetime(df["Start_Date"])) // np.timedelta64(1, 'D'), df['Diff'])

df_unique = df.drop duplicates(subset="ID", keep="last")

df_unique['Diff'] = df_unique['End_Date'].sub(df_unique['Start_Date'], axis=0)

df_final = df_unique.combine_first(df)

df_final = 

Index  ID	Start_Date	End_Date    Diff
0	 118645	2021-01-04	2021-04-28	114
1	 118985	2021-01-11	2022-01-24	378
2	 119023	2021-01-07	2021-09-08	244
3	 119225	2021-01-08	2021-04-11	93
4	 119225	2021-01-08	2021-04-11	93
5	 119276	2021-01-07	2021-03-16	68
6	 119863	2021-01-11	2021-03-25	73
7	 119924	2021-01-13	2021-09-06	236
8	 119924	2021-01-13	2021-11-09	64
9	 119924	2021-01-13	2022-05-23	195
10	 119924	2021-01-13	2022-11-10	666
11	 119987	2021-01-12	2021-02-23	42

是否有更好的解决方法？感谢您的贡献

英文:

If there are duplicate IDs, Diff is the next End_Date minus the previous End_Date and Diff is End_Date minus Start_Date for the last duplicate ID, otherwise Diff is also End_Date minus Start_Date.
My data set looks like the following:

df = 
Index  ID	Start_Date	End_Date
0	 118645	2021-01-04	2021-04-28
1	 118985	2021-01-11	2022-01-24
2	 119023	2021-01-07	2021-09-08
3	 119225	2021-01-08	2021-04-11
4	 119225	2021-01-08	2021-04-11
5	 119276	2021-01-07	2021-03-16
6	 119863	2021-01-11	2021-03-25
7	 119924	2021-01-13	2021-09-06
8	 119924	2021-01-13	2021-11-09
9	 119924	2021-01-13	2022-05-23
10	 119924	2021-01-13	2022-11-10
11	 119987	2021-01-12	2021-02-23

My solution for this problem is as follows:

df[&#39;Diff&#39;] = np.where(df.ID == df.ID.shift(), (pd.to_datetime(df[&quot;End_Date&quot;]) - pd.to_datetime(df[&quot;End_Date&quot;]).shift()) // np.timedelta64(1, &#39;D&#39;), None)
df[&#39;Diff&#39;] = np.where(df.ID != df.ID.shift(), (pd.to_datetime(df[&quot;End_Date&quot;]) - pd.to_datetime(df[&quot;Start_Date&quot;])) // np.timedelta64(1, &#39;D&#39;), df[&#39;Diff&#39;])
df_unique = df.drop_duplicates(subset=&quot;ID&quot;, keep=&quot;last&quot;)
df_unique[&#39;Diff&#39;] = df_unique[&#39;End_Date&#39;].sub(df_unique[&#39;Start_Date&#39;], axis=0)
df_final = df_unique.combine_first(df)
df_final = 
Index  ID	Start_Date	End_Date    Diff
0	 118645	2021-01-04	2021-04-28	114
1	 118985	2021-01-11	2022-01-24	378
2	 119023	2021-01-07	2021-09-08	244
3	 119225	2021-01-08	2021-04-11	93
4	 119225	2021-01-08	2021-04-11	93
5	 119276	2021-01-07	2021-03-16	68
6	 119863	2021-01-11	2021-03-25	73
7	 119924	2021-01-13	2021-09-06	236
8	 119924	2021-01-13	2021-11-09	64
9	 119924	2021-01-13	2022-05-23	195
10	 119924	2021-01-13	2022-11-10	666
11	 119987	2021-01-12	2021-02-23	42

Is there any better way to solve this problem? Thanks for your contributions

答案1

得分: 1

以下是您提供的代码的翻译部分：

import pandas as pd

df = pd.DataFrame({'ID':[118645, 118985, 119023, 119225, 119225, 119276, 119863, 
                         119924, 119924, 119924, 119924, 119987],
                   'Start_Date':['2021-01-04', '2021-01-11', '2021-01-07', '2021-01-08', 
                                 '2021-01-08', '2021-01-07', '2021-01-11', '2021-01-13',
                                 '2021-01-13', '2021-01-13', '2021-01-13', '2021-01-12'],
                   'End_Date':['2021-04-28', '2022-01-24', '2021-09-08', '2021-04-11', 
                                 '2021-04-11', '2021-03-16', '2021-03-25', '2021-09-06',
                                 '2021-11-09', '2022-05-23', '2022-11-10', '2021-02-23']
                   })

def diff(g):
    g['diff'] = (pd.to_datetime(g['End_Date'], infer_datetime_format=True)
                 - pd.to_datetime(g['Start_Date'], infer_datetime_format=True)  
                 ).dt.days
    if len(g) > 1:
        g['diff'][1:-1] = (g['diff'][:-1].diff()[1:]).astype(int)
    return g

r = (df.groupby('ID')
       .apply(lambda g: diff(g))
       )

print(r)

希望这对您有所帮助。如果您有其他问题或需要进一步的翻译，请随时告诉我。

英文:

import pandas as pd

df = pd.DataFrame({&#39;ID&#39;:[118645, 118985, 119023, 119225, 119225, 119276, 119863, 
                         119924, 119924, 119924, 119924, 119987],
                   &#39;Start_Date&#39;:[&#39;2021-01-04&#39;, &#39;2021-01-11&#39;, &#39;2021-01-07&#39;, &#39;2021-01-08&#39;, 
                                 &#39;2021-01-08&#39;, &#39;2021-01-07&#39;, &#39;2021-01-11&#39;, &#39;2021-01-13&#39;,
                                 &#39;2021-01-13&#39;, &#39;2021-01-13&#39;, &#39;2021-01-13&#39;, &#39;2021-01-12&#39;],
                   &#39;End_Date&#39;:[&#39;2021-04-28&#39;, &#39;2022-01-24&#39;, &#39;2021-09-08&#39;, &#39;2021-04-11&#39;, 
                                 &#39;2021-04-11&#39;, &#39;2021-03-16&#39;, &#39;2021-03-25&#39;, &#39;2021-09-06&#39;,
                                 &#39;2021-11-09&#39;, &#39;2022-05-23&#39;, &#39;2022-11-10&#39;, &#39;2021-02-23&#39;]
                   })

def diff(g):
    g[&#39;diff&#39;] = (pd.to_datetime(g[&#39;End_Date&#39;], infer_datetime_format=True)
                 - pd.to_datetime(g[&#39;Start_Date&#39;], infer_datetime_format=True)  
                 ).dt.days
    if len(g) &gt; 1:
        g[&#39;diff&#39;][1:-1] = ( g[&#39;diff&#39;][:-1].diff()[1:] ).astype(int)
    return g

r = (df.groupby(&#39;ID&#39;)
       .apply(lambda g: diff(g))
       )

print(r)

        ID  Start_Date    End_Date  diff
0   118645  2021-01-04  2021-04-28   114
1   118985  2021-01-11  2022-01-24   378
2   119023  2021-01-07  2021-09-08   244
3   119225  2021-01-08  2021-04-11    93
4   119225  2021-01-08  2021-04-11    93
5   119276  2021-01-07  2021-03-16    68
6   119863  2021-01-11  2021-03-25    73
7   119924  2021-01-13  2021-09-06   236
8   119924  2021-01-13  2021-11-09    64
9   119924  2021-01-13  2022-05-23   195
10  119924  2021-01-13  2022-11-10   666
11  119987  2021-01-12  2021-02-23    42

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

计算两列的综合差异和一列中的运行差异。

问题

答案1

python logging logger with only filehandler is writing to both file and stdout/err

XGBoost 无法预测一个简单的正弦函数。

如何获取每个单词的字母总数

Reading a Fortran Data File in Python

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论