识别 Pandas 中满足条件的前一行数值。

huangapple go评论63阅读模式
英文:

Identify the previous row values satisfying the condition in Pandas

问题

I need to identify the previous rows values which meet a condition. For the given row, the value from the previous row where the timestamp < 'timestamp_pre' get the maximum cum_id and its value.

for eg:
input:

ID          timestamp  com_id	Value	timestamp_Pre
aa  2023-1-5 06:33:27      2    33	2023-1-5 06:33:20
aa  2023-1-5 06:33:33      3	22	2023-1-5 06:33:28
aa  2023-1-5 06:33:39      4	44	2023-1-5 06:33:28
aa  2023-1-5 06:33:45      NA	11	2023-1-5 06:33:35
aa  2023-1-5 06:33:51      NA	99	NA
aa  2023-1-5 06:33:57      NA	88	2023-1-5 06:33:44

output:

ID          timestamp  com_id	Value	timestamp_Pre	com_id_pre	value_pre
aa  2023-1-5 06:33:27      2    33	2023-1-5 06:33:20 	   NA	NA
aa  2023-1-5 06:33:33      3	22	2023-1-5 06:33:28		2	33
aa  2023-1-5 06:33:39      4	44	2023-1-5 06:33:28		2	33
aa  2023-1-5 06:33:45      NA	11	2023-1-5 06:33:35		3	22		
aa  2023-1-5 06:33:51      NA	99	NA				        NA	NA
aa  2023-1-5 06:33:57      NA	88	2023-1-5 06:33:44		4	44

Consider row number 4, its com_id_pre and value_pre should be from the previous row whose timestamp is < 2023-1-5 06:33:35

code used:

m=(df['timestamp']<df['timestamp_Pre']) & (df['com_id']>0)
df['com_id_pre']=df[m].groupby('id')['com_id'].transform('max')

df['com_id_pre']=df['com_id'].where(df['timestamp']<df['timestamp_Pre']).transform('max')
英文:

I need to identify the previous rows values which meet a condition. For the given row, the value from the previous row where the timestamp< 'timestamp_pre get the maximum cum_id and its value.

for eg:
input:

ID          timestamp  com_id	Value	timestamp_Pre
aa  2023-1-5 06:33:27      2    33	2023-1-5 06:33:20
aa  2023-1-5 06:33:33      3	22	2023-1-5 06:33:28
aa  2023-1-5 06:33:39      4	44	2023-1-5 06:33:28
aa  2023-1-5 06:33:45      NA	11	2023-1-5 06:33:35
aa  2023-1-5 06:33:51      NA	99	NA
aa  2023-1-5 06:33:57      NA	88	2023-1-5 06:33:44

output:

ID          timestamp  com_id	Value	timestamp_Pre	com_id_pre	value_pre
aa  2023-1-5 06:33:27      2    33	2023-1-5 06:33:20 	   NA	NA
aa  2023-1-5 06:33:33      3	22	2023-1-5 06:33:28		2	33
aa  2023-1-5 06:33:39      4	44	2023-1-5 06:33:28		2	33
aa  2023-1-5 06:33:45      NA	11	2023-1-5 06:33:35		3	22		
aa  2023-1-5 06:33:51      NA	99	NA				        NA	NA
aa  2023-1-5 06:33:57      NA	88	2023-1-5 06:33:44		4	44

Consider row number 4, its com_id_pre and value_pre should be from the previous row whose timestamp is < 2023-1-5 06:33:35

code used:

m=(df[&#39;timestamp&#39;]&lt;df[&#39;timestamp_Pre&#39;]) &amp; (df[&#39;com_id&#39;]&gt;0)
df[&#39;com_id_pre&#39;]=df[m].groupby(&#39;id)[&#39;com_id&#39;].transform(&#39;max&#39;)

df[&#39;com_id_pre&#39;]=df[&#39;com_id&#39;].where(df[&#39;timestamp&#39;]&lt;df[&#39;timestamp_Pre&#39;]).transform(&#39;max&#39;)

答案1

得分: 0

使用 merge_asof 结合 DataFrame.dropna 进行左连接,并删除含有缺失值的行:

df = df.merge(pd.merge_asof(df.dropna(subset=['timestamp_Pre']), df, 
                   left_on='timestamp_Pre', 
                   right_on='timestamp',
                   allow_exact_matches=False,
                   by='ID',
                   suffixes=('', '_pre')).drop(['timestamp_Pre_pre','timestamp_pre'],axis=1),
              how='left')
print (df)
   ID           timestamp  com_id  Value       timestamp_Pre  com_id_pre  \
0  aa 2023-01-05 06:33:27     2.0     33 2023-01-05 06:33:20         NaN   
1  aa 2023-01-05 06:33:33     3.0     22 2023-01-05 06:33:28         2.0   
2  aa 2023-01-05 06:33:39     4.0     44 2023-01-05 06:33:28         2.0   
3  aa 2023-01-05 06:33:45     NaN     11 2023-01-05 06:33:35         3.0   
4  aa 2023-01-05 06:33:51     NaN     99                 NaT         NaN   
5  aa 2023-01-05 06:33:57     NaN     88 2023-01-05 06:33:44         4.0   

   Value_pre  
0        NaN  
1       33.0  
2       33.0  
3       22.0  
4        NaN  
5       44.0  

或者通过将缺失值替换为最小可能的 Timestamp,处理后再替换缺失值:

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['timestamp_Pre'] = pd.to_datetime(df['timestamp_Pre'])

df =( pd.merge_asof(df.fillna({'timestamp_Pre':pd.Timestamp.min})
                      .sort_values('timestamp_Pre').reset_index(), 
                   df, 
                   left_on='timestamp_Pre', 
                   right_on='timestamp',
                   allow_exact_matches=False,
                   by='ID',
                   suffixes=('', '_pre'))
         .sort_values('index', ignore_index=True)
         .drop(['timestamp_Pre_pre','timestamp_pre','index'], axis=1)
         .assign(timestamp_Pre = lambda x: x['timestamp_Pre']
                                 .mask(x['timestamp_Pre'].dt.year.eq(1677))))
print (df)
   ID           timestamp  com_id  Value       timestamp_Pre  com_id_pre  \
0  aa 2023-01-05 06:33:27     2.0     33 2023-01-05 06:33:20         NaN   
1  aa 2023-01-05 06:33:33     3.0     22 2023-01-05 06:33:28         2.0   
2  aa 2023-01-05 06:33:39     4.0     44 2023-01-05 06:33:28         2.0   
3  aa 2023-01-05 06:33:45     NaN     11 2023-01-05 06:33:35         3.0   
4  aa 2023-01-05 06:33:51     NaN     99                 NaT         NaN   
5  aa 2023-01-05 06:33:57     NaN     88 2023-01-05 06:33:44         4.0   

   Value_pre  
0        NaN  
1       33.0  
2       33.0  
3       22.0  
4        NaN  
5       44.0  
英文:

Use merge_asof with remove rows with missing values by DataFrame.dropna with left join:

df = df.merge(pd.merge_asof(df.dropna(subset=[&#39;timestamp_Pre&#39;]), df, 
left_on=&#39;timestamp_Pre&#39;, 
right_on=&#39;timestamp&#39;,
allow_exact_matches=False,
by=&#39;ID&#39;,
suffixes=(&#39;&#39;,&#39;_pre&#39;)).drop([&#39;timestamp_Pre_pre&#39;,&#39;timestamp_pre&#39;],axis=1),
how=&#39;left&#39;)
print (df)
ID           timestamp  com_id  Value       timestamp_Pre  com_id_pre  \
0  aa 2023-01-05 06:33:27     2.0     33 2023-01-05 06:33:20         NaN   
1  aa 2023-01-05 06:33:33     3.0     22 2023-01-05 06:33:28         2.0   
2  aa 2023-01-05 06:33:39     4.0     44 2023-01-05 06:33:28         2.0   
3  aa 2023-01-05 06:33:45     NaN     11 2023-01-05 06:33:35         3.0   
4  aa 2023-01-05 06:33:51     NaN     99                 NaT         NaN   
5  aa 2023-01-05 06:33:57     NaN     88 2023-01-05 06:33:44         4.0   
Value_pre  
0        NaN  
1       33.0  
2       33.0  
3       22.0  
4        NaN  
5       44.0  

Or repalce missing values by smallest possible Timestamp, processing and last replace to missing values:

df[&#39;timestamp&#39;] = pd.to_datetime(df[&#39;timestamp&#39;])
df[&#39;timestamp_Pre&#39;] = pd.to_datetime(df[&#39;timestamp_Pre&#39;])
df =( pd.merge_asof(df.fillna({&#39;timestamp_Pre&#39;:pd.Timestamp.min})
.sort_values(&#39;timestamp_Pre&#39;).reset_index(), 
df, 
left_on=&#39;timestamp_Pre&#39;, 
right_on=&#39;timestamp&#39;,
allow_exact_matches=False,
by=&#39;ID&#39;,
suffixes=(&#39;&#39;,&#39;_pre&#39;))
.sort_values(&#39;index&#39;, ignore_index=True)
.drop([&#39;timestamp_Pre_pre&#39;,&#39;timestamp_pre&#39;,&#39;index&#39;], axis=1)
.assign(timestamp_Pre = lambda x: x[&#39;timestamp_Pre&#39;]
.mask(x[&#39;timestamp_Pre&#39;].dt.year.eq(1677))))
print (df)
ID           timestamp  com_id  Value       timestamp_Pre  com_id_pre  \
0  aa 2023-01-05 06:33:27     2.0     33 2023-01-05 06:33:20         NaN   
1  aa 2023-01-05 06:33:33     3.0     22 2023-01-05 06:33:28         2.0   
2  aa 2023-01-05 06:33:39     4.0     44 2023-01-05 06:33:28         2.0   
3  aa 2023-01-05 06:33:45     NaN     11 2023-01-05 06:33:35         3.0   
4  aa 2023-01-05 06:33:51     NaN     99                 NaT         NaN   
5  aa 2023-01-05 06:33:57     NaN     88 2023-01-05 06:33:44         4.0   
Value_pre  
0        NaN  
1       33.0  
2       33.0  
3       22.0  
4        NaN  
5       44.0 

答案2

得分: 0

你可以使用[tag:numpy]来完成这个任务:

# 按照所需的优先级顺序对值进行排序(顶部=较高)
# 这可以是基于多列的排序,这里我们将较高的com_id放在顶部
df2 = df.sort_values(by='com_id', ascending=False)

# 确保日期时间数据类型
df2[['timestamp', 'timestamp_Pre']] = df2[['timestamp', 'timestamp_Pre']].apply(pd.to_datetime)

# 确定timestamp_Pre大于timestamp的情况
m = (df2['timestamp_Pre'].to_numpy()[:, None] > df2['timestamp'].to_numpy())

# 获取每个timestamp_Pre的第一个有效索引
m2 = m.any(axis=1)
idx = df2.index[m.argmax(axis=1)][m2]

# 分配为新列
out = df.join(df2.loc[idx, ['com_id', 'Value']]
                 .set_axis(df2.index[m2]).add_suffix('_pre')
              )

输出结果:

   ID          timestamp  com_id  Value      timestamp_Pre  com_id_pre  Value_pre
0  aa  2023-1-5 06:33:27     2.0     33  2023-1-5 06:33:20         NaN        NaN
1  aa  2023-1-5 06:33:33     3.0     22  2023-1-5 06:33:28         2.0       33.0
2  aa  2023-1-5 06:33:39     4.0     44  2023-1-5 06:33:28         2.0       33.0
3  aa  2023-1-5 06:33:45     NaN     11  2023-1-5 06:33:35         3.0       22.0
4  aa  2023-1-5 06:33:51     NaN     99                NaN         NaN        NaN
5  aa  2023-1-5 06:33:57     NaN     88  2023-1-5 06:33:44         4.0       44.0
英文:

You can use [tag:numpy] for this:

# sort the values in the desired order of priority (top = higher)
# this could be on several columns, here we put higher com_id on top
df2 = df.sort_values(by=&#39;com_id&#39;, ascending=False)

# ensure datetime type
df2[[&#39;timestamp&#39;, &#39;timestamp_Pre&#39;]] = df2[[&#39;timestamp&#39;, &#39;timestamp_Pre&#39;]
                                          ].apply(pd.to_datetime)

# identify timestamp_Pre that are greater than timestamp
m = (df2[&#39;timestamp_Pre&#39;].to_numpy()[:,None] &gt; df2[&#39;timestamp&#39;].to_numpy())

# get first valid index per timestamp_Pre
m2 = m.any(axis=1)
idx = df2.index[m.argmax(axis=1)][m2]

# assign as new columns
out = df.join(df2.loc[idx, [&#39;com_id&#39;, &#39;Value&#39;]]
                 .set_axis(df2.index[m2]).add_suffix(&#39;_pre&#39;)
              )

Output:

   ID          timestamp  com_id  Value      timestamp_Pre  com_id_pre  Value_pre
0  aa  2023-1-5 06:33:27     2.0     33  2023-1-5 06:33:20         NaN        NaN
1  aa  2023-1-5 06:33:33     3.0     22  2023-1-5 06:33:28         2.0       33.0
2  aa  2023-1-5 06:33:39     4.0     44  2023-1-5 06:33:28         2.0       33.0
3  aa  2023-1-5 06:33:45     NaN     11  2023-1-5 06:33:35         3.0       22.0
4  aa  2023-1-5 06:33:51     NaN     99                NaN         NaN        NaN
5  aa  2023-1-5 06:33:57     NaN     88  2023-1-5 06:33:44         4.0       44.0

huangapple
  • 本文由 发表于 2023年3月8日 16:00:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/75670546.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定