英文:
Identify the previous row values satisfying the condition in Pandas
问题
I need to identify the previous rows values which meet a condition. For the given row, the value from the previous row where the timestamp < 'timestamp_pre' get the maximum cum_id and its value.
for eg:
input:
ID timestamp com_id Value timestamp_Pre
aa 2023-1-5 06:33:27 2 33 2023-1-5 06:33:20
aa 2023-1-5 06:33:33 3 22 2023-1-5 06:33:28
aa 2023-1-5 06:33:39 4 44 2023-1-5 06:33:28
aa 2023-1-5 06:33:45 NA 11 2023-1-5 06:33:35
aa 2023-1-5 06:33:51 NA 99 NA
aa 2023-1-5 06:33:57 NA 88 2023-1-5 06:33:44
output:
ID timestamp com_id Value timestamp_Pre com_id_pre value_pre
aa 2023-1-5 06:33:27 2 33 2023-1-5 06:33:20 NA NA
aa 2023-1-5 06:33:33 3 22 2023-1-5 06:33:28 2 33
aa 2023-1-5 06:33:39 4 44 2023-1-5 06:33:28 2 33
aa 2023-1-5 06:33:45 NA 11 2023-1-5 06:33:35 3 22
aa 2023-1-5 06:33:51 NA 99 NA NA NA
aa 2023-1-5 06:33:57 NA 88 2023-1-5 06:33:44 4 44
Consider row number 4, its com_id_pre and value_pre should be from the previous row whose timestamp is < 2023-1-5 06:33:35
code used:
m=(df['timestamp']<df['timestamp_Pre']) & (df['com_id']>0)
df['com_id_pre']=df[m].groupby('id')['com_id'].transform('max')
df['com_id_pre']=df['com_id'].where(df['timestamp']<df['timestamp_Pre']).transform('max')
英文:
I need to identify the previous rows values which meet a condition. For the given row, the value from the previous row where the timestamp< 'timestamp_pre get the maximum cum_id and its value.
for eg:
input:
ID timestamp com_id Value timestamp_Pre
aa 2023-1-5 06:33:27 2 33 2023-1-5 06:33:20
aa 2023-1-5 06:33:33 3 22 2023-1-5 06:33:28
aa 2023-1-5 06:33:39 4 44 2023-1-5 06:33:28
aa 2023-1-5 06:33:45 NA 11 2023-1-5 06:33:35
aa 2023-1-5 06:33:51 NA 99 NA
aa 2023-1-5 06:33:57 NA 88 2023-1-5 06:33:44
output:
ID timestamp com_id Value timestamp_Pre com_id_pre value_pre
aa 2023-1-5 06:33:27 2 33 2023-1-5 06:33:20 NA NA
aa 2023-1-5 06:33:33 3 22 2023-1-5 06:33:28 2 33
aa 2023-1-5 06:33:39 4 44 2023-1-5 06:33:28 2 33
aa 2023-1-5 06:33:45 NA 11 2023-1-5 06:33:35 3 22
aa 2023-1-5 06:33:51 NA 99 NA NA NA
aa 2023-1-5 06:33:57 NA 88 2023-1-5 06:33:44 4 44
Consider row number 4, its com_id_pre and value_pre should be from the previous row whose timestamp is < 2023-1-5 06:33:35
code used:
m=(df['timestamp']<df['timestamp_Pre']) & (df['com_id']>0)
df['com_id_pre']=df[m].groupby('id)['com_id'].transform('max')
df['com_id_pre']=df['com_id'].where(df['timestamp']<df['timestamp_Pre']).transform('max')
答案1
得分: 0
使用 merge_asof
结合 DataFrame.dropna
进行左连接,并删除含有缺失值的行:
df = df.merge(pd.merge_asof(df.dropna(subset=['timestamp_Pre']), df,
left_on='timestamp_Pre',
right_on='timestamp',
allow_exact_matches=False,
by='ID',
suffixes=('', '_pre')).drop(['timestamp_Pre_pre','timestamp_pre'],axis=1),
how='left')
print (df)
ID timestamp com_id Value timestamp_Pre com_id_pre \
0 aa 2023-01-05 06:33:27 2.0 33 2023-01-05 06:33:20 NaN
1 aa 2023-01-05 06:33:33 3.0 22 2023-01-05 06:33:28 2.0
2 aa 2023-01-05 06:33:39 4.0 44 2023-01-05 06:33:28 2.0
3 aa 2023-01-05 06:33:45 NaN 11 2023-01-05 06:33:35 3.0
4 aa 2023-01-05 06:33:51 NaN 99 NaT NaN
5 aa 2023-01-05 06:33:57 NaN 88 2023-01-05 06:33:44 4.0
Value_pre
0 NaN
1 33.0
2 33.0
3 22.0
4 NaN
5 44.0
或者通过将缺失值替换为最小可能的 Timestamp
,处理后再替换缺失值:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['timestamp_Pre'] = pd.to_datetime(df['timestamp_Pre'])
df =( pd.merge_asof(df.fillna({'timestamp_Pre':pd.Timestamp.min})
.sort_values('timestamp_Pre').reset_index(),
df,
left_on='timestamp_Pre',
right_on='timestamp',
allow_exact_matches=False,
by='ID',
suffixes=('', '_pre'))
.sort_values('index', ignore_index=True)
.drop(['timestamp_Pre_pre','timestamp_pre','index'], axis=1)
.assign(timestamp_Pre = lambda x: x['timestamp_Pre']
.mask(x['timestamp_Pre'].dt.year.eq(1677))))
print (df)
ID timestamp com_id Value timestamp_Pre com_id_pre \
0 aa 2023-01-05 06:33:27 2.0 33 2023-01-05 06:33:20 NaN
1 aa 2023-01-05 06:33:33 3.0 22 2023-01-05 06:33:28 2.0
2 aa 2023-01-05 06:33:39 4.0 44 2023-01-05 06:33:28 2.0
3 aa 2023-01-05 06:33:45 NaN 11 2023-01-05 06:33:35 3.0
4 aa 2023-01-05 06:33:51 NaN 99 NaT NaN
5 aa 2023-01-05 06:33:57 NaN 88 2023-01-05 06:33:44 4.0
Value_pre
0 NaN
1 33.0
2 33.0
3 22.0
4 NaN
5 44.0
英文:
Use merge_asof
with remove rows with missing values by DataFrame.dropna
with left join:
df = df.merge(pd.merge_asof(df.dropna(subset=['timestamp_Pre']), df,
left_on='timestamp_Pre',
right_on='timestamp',
allow_exact_matches=False,
by='ID',
suffixes=('','_pre')).drop(['timestamp_Pre_pre','timestamp_pre'],axis=1),
how='left')
print (df)
ID timestamp com_id Value timestamp_Pre com_id_pre \
0 aa 2023-01-05 06:33:27 2.0 33 2023-01-05 06:33:20 NaN
1 aa 2023-01-05 06:33:33 3.0 22 2023-01-05 06:33:28 2.0
2 aa 2023-01-05 06:33:39 4.0 44 2023-01-05 06:33:28 2.0
3 aa 2023-01-05 06:33:45 NaN 11 2023-01-05 06:33:35 3.0
4 aa 2023-01-05 06:33:51 NaN 99 NaT NaN
5 aa 2023-01-05 06:33:57 NaN 88 2023-01-05 06:33:44 4.0
Value_pre
0 NaN
1 33.0
2 33.0
3 22.0
4 NaN
5 44.0
Or repalce missing values by smallest possible Timestamp
, processing and last replace to missing values:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['timestamp_Pre'] = pd.to_datetime(df['timestamp_Pre'])
df =( pd.merge_asof(df.fillna({'timestamp_Pre':pd.Timestamp.min})
.sort_values('timestamp_Pre').reset_index(),
df,
left_on='timestamp_Pre',
right_on='timestamp',
allow_exact_matches=False,
by='ID',
suffixes=('','_pre'))
.sort_values('index', ignore_index=True)
.drop(['timestamp_Pre_pre','timestamp_pre','index'], axis=1)
.assign(timestamp_Pre = lambda x: x['timestamp_Pre']
.mask(x['timestamp_Pre'].dt.year.eq(1677))))
print (df)
ID timestamp com_id Value timestamp_Pre com_id_pre \
0 aa 2023-01-05 06:33:27 2.0 33 2023-01-05 06:33:20 NaN
1 aa 2023-01-05 06:33:33 3.0 22 2023-01-05 06:33:28 2.0
2 aa 2023-01-05 06:33:39 4.0 44 2023-01-05 06:33:28 2.0
3 aa 2023-01-05 06:33:45 NaN 11 2023-01-05 06:33:35 3.0
4 aa 2023-01-05 06:33:51 NaN 99 NaT NaN
5 aa 2023-01-05 06:33:57 NaN 88 2023-01-05 06:33:44 4.0
Value_pre
0 NaN
1 33.0
2 33.0
3 22.0
4 NaN
5 44.0
答案2
得分: 0
你可以使用[tag:numpy]来完成这个任务:
# 按照所需的优先级顺序对值进行排序(顶部=较高)
# 这可以是基于多列的排序,这里我们将较高的com_id放在顶部
df2 = df.sort_values(by='com_id', ascending=False)
# 确保日期时间数据类型
df2[['timestamp', 'timestamp_Pre']] = df2[['timestamp', 'timestamp_Pre']].apply(pd.to_datetime)
# 确定timestamp_Pre大于timestamp的情况
m = (df2['timestamp_Pre'].to_numpy()[:, None] > df2['timestamp'].to_numpy())
# 获取每个timestamp_Pre的第一个有效索引
m2 = m.any(axis=1)
idx = df2.index[m.argmax(axis=1)][m2]
# 分配为新列
out = df.join(df2.loc[idx, ['com_id', 'Value']]
.set_axis(df2.index[m2]).add_suffix('_pre')
)
输出结果:
ID timestamp com_id Value timestamp_Pre com_id_pre Value_pre
0 aa 2023-1-5 06:33:27 2.0 33 2023-1-5 06:33:20 NaN NaN
1 aa 2023-1-5 06:33:33 3.0 22 2023-1-5 06:33:28 2.0 33.0
2 aa 2023-1-5 06:33:39 4.0 44 2023-1-5 06:33:28 2.0 33.0
3 aa 2023-1-5 06:33:45 NaN 11 2023-1-5 06:33:35 3.0 22.0
4 aa 2023-1-5 06:33:51 NaN 99 NaN NaN NaN
5 aa 2023-1-5 06:33:57 NaN 88 2023-1-5 06:33:44 4.0 44.0
英文:
You can use [tag:numpy] for this:
# sort the values in the desired order of priority (top = higher)
# this could be on several columns, here we put higher com_id on top
df2 = df.sort_values(by='com_id', ascending=False)
# ensure datetime type
df2[['timestamp', 'timestamp_Pre']] = df2[['timestamp', 'timestamp_Pre']
].apply(pd.to_datetime)
# identify timestamp_Pre that are greater than timestamp
m = (df2['timestamp_Pre'].to_numpy()[:,None] > df2['timestamp'].to_numpy())
# get first valid index per timestamp_Pre
m2 = m.any(axis=1)
idx = df2.index[m.argmax(axis=1)][m2]
# assign as new columns
out = df.join(df2.loc[idx, ['com_id', 'Value']]
.set_axis(df2.index[m2]).add_suffix('_pre')
)
Output:
ID timestamp com_id Value timestamp_Pre com_id_pre Value_pre
0 aa 2023-1-5 06:33:27 2.0 33 2023-1-5 06:33:20 NaN NaN
1 aa 2023-1-5 06:33:33 3.0 22 2023-1-5 06:33:28 2.0 33.0
2 aa 2023-1-5 06:33:39 4.0 44 2023-1-5 06:33:28 2.0 33.0
3 aa 2023-1-5 06:33:45 NaN 11 2023-1-5 06:33:35 3.0 22.0
4 aa 2023-1-5 06:33:51 NaN 99 NaN NaN NaN
5 aa 2023-1-5 06:33:57 NaN 88 2023-1-5 06:33:44 4.0 44.0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论