英文:
Compute a combined difference of two columns and a running difference in a column
问题
如果存在重复的ID,则Diff是下一个End_Date减去前一个End_Date,而对于最后一个重复的ID,Diff也是End_Date减去Start_Date。否则,Diff也是End_Date减去Start_Date。
我的数据集如下所示:
df =
Index ID Start_Date End_Date
0 118645 2021-01-04 2021-04-28
1 118985 2021-01-11 2022-01-24
2 119023 2021-01-07 2021-09-08
3 119225 2021-01-08 2021-04-11
4 119225 2021-01-08 2021-04-11
5 119276 2021-01-07 2021-03-16
6 119863 2021-01-11 2021-03-25
7 119924 2021-01-13 2021-09-06
8 119924 2021-01-13 2021-11-09
9 119924 2021-01-13 2022-05-23
10 119924 2021-01-13 2022-11-10
11 119987 2021-01-12 2021-02-23
我对这个问题的解决方法如下:
df['Diff'] = np.where(df.ID == df.ID.shift(), (pd.to_datetime(df["End_Date"]) - pd.to_datetime(df["End_Date"]).shift()) // np.timedelta64(1, 'D'), None)
df['Diff'] = np.where(df.ID != df.ID.shift(), (pd.to_datetime(df["End_Date"]) - pd.to_datetime(df["Start_Date"])) // np.timedelta64(1, 'D'), df['Diff'])
df_unique = df.drop duplicates(subset="ID", keep="last")
df_unique['Diff'] = df_unique['End_Date'].sub(df_unique['Start_Date'], axis=0)
df_final = df_unique.combine_first(df)
df_final =
Index ID Start_Date End_Date Diff
0 118645 2021-01-04 2021-04-28 114
1 118985 2021-01-11 2022-01-24 378
2 119023 2021-01-07 2021-09-08 244
3 119225 2021-01-08 2021-04-11 93
4 119225 2021-01-08 2021-04-11 93
5 119276 2021-01-07 2021-03-16 68
6 119863 2021-01-11 2021-03-25 73
7 119924 2021-01-13 2021-09-06 236
8 119924 2021-01-13 2021-11-09 64
9 119924 2021-01-13 2022-05-23 195
10 119924 2021-01-13 2022-11-10 666
11 119987 2021-01-12 2021-02-23 42
是否有更好的解决方法?感谢您的贡献
英文:
If there are duplicate IDs, Diff is the next End_Date minus the previous End_Date and Diff is End_Date minus Start_Date for the last duplicate ID, otherwise Diff is also End_Date minus Start_Date.
My data set looks like the following:
df =
Index ID Start_Date End_Date
0 118645 2021-01-04 2021-04-28
1 118985 2021-01-11 2022-01-24
2 119023 2021-01-07 2021-09-08
3 119225 2021-01-08 2021-04-11
4 119225 2021-01-08 2021-04-11
5 119276 2021-01-07 2021-03-16
6 119863 2021-01-11 2021-03-25
7 119924 2021-01-13 2021-09-06
8 119924 2021-01-13 2021-11-09
9 119924 2021-01-13 2022-05-23
10 119924 2021-01-13 2022-11-10
11 119987 2021-01-12 2021-02-23
My solution for this problem is as follows:
df['Diff'] = np.where(df.ID == df.ID.shift(), (pd.to_datetime(df["End_Date"]) - pd.to_datetime(df["End_Date"]).shift()) // np.timedelta64(1, 'D'), None)
df['Diff'] = np.where(df.ID != df.ID.shift(), (pd.to_datetime(df["End_Date"]) - pd.to_datetime(df["Start_Date"])) // np.timedelta64(1, 'D'), df['Diff'])
df_unique = df.drop_duplicates(subset="ID", keep="last")
df_unique['Diff'] = df_unique['End_Date'].sub(df_unique['Start_Date'], axis=0)
df_final = df_unique.combine_first(df)
df_final =
Index ID Start_Date End_Date Diff
0 118645 2021-01-04 2021-04-28 114
1 118985 2021-01-11 2022-01-24 378
2 119023 2021-01-07 2021-09-08 244
3 119225 2021-01-08 2021-04-11 93
4 119225 2021-01-08 2021-04-11 93
5 119276 2021-01-07 2021-03-16 68
6 119863 2021-01-11 2021-03-25 73
7 119924 2021-01-13 2021-09-06 236
8 119924 2021-01-13 2021-11-09 64
9 119924 2021-01-13 2022-05-23 195
10 119924 2021-01-13 2022-11-10 666
11 119987 2021-01-12 2021-02-23 42
Is there any better way to solve this problem? Thanks for your contributions
答案1
得分: 1
以下是您提供的代码的翻译部分:
import pandas as pd
df = pd.DataFrame({'ID':[118645, 118985, 119023, 119225, 119225, 119276, 119863,
119924, 119924, 119924, 119924, 119987],
'Start_Date':['2021-01-04', '2021-01-11', '2021-01-07', '2021-01-08',
'2021-01-08', '2021-01-07', '2021-01-11', '2021-01-13',
'2021-01-13', '2021-01-13', '2021-01-13', '2021-01-12'],
'End_Date':['2021-04-28', '2022-01-24', '2021-09-08', '2021-04-11',
'2021-04-11', '2021-03-16', '2021-03-25', '2021-09-06',
'2021-11-09', '2022-05-23', '2022-11-10', '2021-02-23']
})
def diff(g):
g['diff'] = (pd.to_datetime(g['End_Date'], infer_datetime_format=True)
- pd.to_datetime(g['Start_Date'], infer_datetime_format=True)
).dt.days
if len(g) > 1:
g['diff'][1:-1] = (g['diff'][:-1].diff()[1:]).astype(int)
return g
r = (df.groupby('ID')
.apply(lambda g: diff(g))
)
print(r)
希望这对您有所帮助。如果您有其他问题或需要进一步的翻译,请随时告诉我。
英文:
import pandas as pd
df = pd.DataFrame({'ID':[118645, 118985, 119023, 119225, 119225, 119276, 119863,
119924, 119924, 119924, 119924, 119987],
'Start_Date':['2021-01-04', '2021-01-11', '2021-01-07', '2021-01-08',
'2021-01-08', '2021-01-07', '2021-01-11', '2021-01-13',
'2021-01-13', '2021-01-13', '2021-01-13', '2021-01-12'],
'End_Date':['2021-04-28', '2022-01-24', '2021-09-08', '2021-04-11',
'2021-04-11', '2021-03-16', '2021-03-25', '2021-09-06',
'2021-11-09', '2022-05-23', '2022-11-10', '2021-02-23']
})
def diff(g):
g['diff'] = (pd.to_datetime(g['End_Date'], infer_datetime_format=True)
- pd.to_datetime(g['Start_Date'], infer_datetime_format=True)
).dt.days
if len(g) > 1:
g['diff'][1:-1] = ( g['diff'][:-1].diff()[1:] ).astype(int)
return g
r = (df.groupby('ID')
.apply(lambda g: diff(g))
)
print(r)
ID Start_Date End_Date diff
0 118645 2021-01-04 2021-04-28 114
1 118985 2021-01-11 2022-01-24 378
2 119023 2021-01-07 2021-09-08 244
3 119225 2021-01-08 2021-04-11 93
4 119225 2021-01-08 2021-04-11 93
5 119276 2021-01-07 2021-03-16 68
6 119863 2021-01-11 2021-03-25 73
7 119924 2021-01-13 2021-09-06 236
8 119924 2021-01-13 2021-11-09 64
9 119924 2021-01-13 2022-05-23 195
10 119924 2021-01-13 2022-11-10 666
11 119987 2021-01-12 2021-02-23 42
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论