英文:
How to assign new column value based on max value in another column preceding date
问题
我想在以下数据框中创建一个名为CDAT的新列。对于"PREG"事件,CDAT等于与相同的ID、LACT和FDAT组合的最后一个"BRED"事件的"DATE",该事件在"PREG"事件之前。
ID LACT FDAT EVENT DATE
0 46 1 2011-09-23 BRED 2012-03-02
1 46 1 2011-09-23 PREG 2012-04-03
2 46 1 2011-09-23 PREG 2012-05-22
3 46 1 2011-09-23 PREG 2012-10-09
4 46 2 2012-11-15 FRESH 2012-11-15
5 46 2 2012-11-15 LUT 2013-01-08
6 46 2 2012-11-15 OS 2013-01-15
7 46 2 2012-11-15 BRED 2013-01-01
8 46 2 2012-11-15 BRED 2013-01-24
9 46 2 2012-11-15 PREG 2013-02-26
10 46 2 2012-11-16 BRED 2013-03-10
我想要的输出是
ID LACT FDAT EVENT DATE CDAT
0 46 1 2011-09-23 BRED 2012-03-02
1 46 1 2011-09-23 PREG 2012-04-03 2012-03-02
2 46 1 2011-09-23 PREG 2012-05-22 2012-03-02
3 46 1 2011-09-23 PREG 2012-10-09 2012-03-02
4 46 2 2012-11-15 FRESH 2012-11-15
5 46 2 2012-11-15 LUT 2013-01-08
6 46 2 2012-11-15 OS 2013-01-15
7 46 2 2012-11-15 BRED 2013-01-01
8 46 2 2012-11-15 BRED 2013-01-24
9 46 2 2012-11-15 PREG 2013-02-26 2013-01-24
10 46 2 2012-11-16 BRED 2013-03-10
英文:
I would like to create a new column called CDAT in the following dataframe. With CDAT equal to the "DATE" of the last "BRED" EVENT from the same ID, LACT and FDAT combination that preceded the "PREG" Event
Effectively I need to groupby on ID, LACT and FDAT and then when there is a "PREG" Event fill the New CDAT column with the "DATE" from the most recent "BRED" Event that precedes the date of the "PREG" EVENT.
An example of the data is presented below
ID LACT FDAT EVENT DATE
0 46 1 2011-09-23 BRED 2012-03-02
1 46 1 2011-09-23 PREG 2012-04-03
2 46 1 2011-09-23 PREG 2012-05-22
3 46 1 2011-09-23 PREG 2012-10-09
4 46 2 2012-11-15 FRESH 2012-11-15
5 46 2 2012-11-15 LUT 2013-01-08
6 46 2 2012-11-15 OS 2013-01-15
7 46 2 2012-11-15 BRED 2013-01-01
8 46 2 2012-11-15 BRED 2013-01-24
9 46 2 2012-11-15 PREG 2013-02-26
10 46 2 2012-11-16 BRED 2013-03-10
The Output I would like to achieve is
ID LACT FDAT EVENT DATE CDAT
0 46 1 2011-09-23 BRED 2012-03-02
1 46 1 2011-09-23 PREG 2012-04-03 2012-03-02
2 46 1 2011-09-23 PREG 2012-05-22 2012-03-02
3 46 1 2011-09-23 PREG 2012-10-09 2012-03-02
4 46 2 2012-11-15 FRESH 2012-11-15
5 46 2 2012-11-15 LUT 2013-01-08
6 46 2 2012-11-15 OS 2013-01-15
7 46 2 2012-11-15 BRED 2013-01-01
8 46 2 2012-11-15 BRED 2013-01-24
9 46 2 2012-11-15 PREG 2013-02-26 2013-01-24
10 46 2 2012-11-16 BRED 2013-03-10
I cannot think of a way to incorporate the date and EVENT selection into a groupby statement that would achieve what I would like to do.
A list of the sample data is presented below
[[46,1,Timestamp('2011-09-23 00:00:00'),'BRED',Timestamp('2012-03-02 00:00:00')],
[46,1,Timestamp('2011-09-23 00:00:00'),'PREG',Timestamp('2012-04-03 00:00:00')],
[46,1,Timestamp('2011-09-23 00:00:00'),'PREG',Timestamp('2012-05-22 00:00:00')],
[46,1,Timestamp('2011-09-23 00:00:00'),'PREG',Timestamp('2012-10-09 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'FRESH',Timestamp('2012-11-15 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'LUT',Timestamp('2013-01-08 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'OS',Timestamp('2013-01-15 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'BRED',Timestamp('2013-01-01 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'BRED',Timestamp('2013-01-24 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'PREG',Timestamp('2013-02-26 00:00:00')],
[46,2,Timestamp('2012-11-16 00:00:00'),'BRED',Timestamp('2013-03-10 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'PREG',Timestamp('2013-04-16 00:00:00')],
[46,2,Timestamp('2001-11-15 00:00:00'),'PREG',Timestamp('2013-08-06 00:00:00')]]
答案1
得分: 1
这应该可以运行。
import pandas as pd
import numpy as np
df = pd.DataFrame([[46, 1, pd.Timestamp('2011-09-23'), 'BRED', pd.Timestamp('2012-03-02')],
[46, 1, pd.Timestamp('2011-09-23'), 'PREG', pd.Timestamp('2012-04-03')],
[46, 1, pd.Timestamp('2011-09-23'), 'PREG', pd.Timestamp('2012-05-22')],
[46, 1, pd.Timestamp('2011-09-23'), 'PREG', pd.Timestamp('2012-10-09')],
[46, 2, pd.Timestamp('2012-11-15'), 'FRESH', pd.Timestamp('2012-11-15')],
[46, 2, pd.Timestamp('2012-11-15'), 'LUT', pd.Timestamp('2013-01-08')],
[46, 2, pd.Timestamp('2012-11-15'), 'OS', pd.Timestamp('2013-01-15')],
[46, 2, pd.Timestamp('2012-11-15'), 'BRED', pd.Timestamp('2013-01-01')],
[46, 2, pd.Timestamp('2012-11-15'), 'BRED', pd.Timestamp('2013-01-24')],
[46, 2, pd.Timestamp('2012-11-15'), 'PREG', pd.Timestamp('2013-02-26')],
[46, 2, pd.Timestamp('2012-11-16'), 'BRED', pd.Timestamp('2013-03-10')]],
columns=['ID', 'LACT', 'FDAT', 'EVENT', 'DATE'])
df = df.sort_values(['ID', 'LACT', 'FDAT', 'DATE'])
last_bred_dates = []
for name, group in df.groupby(['ID', 'LACT', 'FDAT']):
last_bred_date = np.nan
for i, row in group.iterrows():
if row['EVENT'] == 'BRED':
last_bred_date = row['DATE']
last_bred_dates.append(np.nan)
elif row['EVENT'] == 'PREG':
last_bred_dates.append(last_bred_date)
else:
last_bred_dates.append(np.nan)
df['CDAT'] = pd.Series(last_bred_dates)
输出:
| ID | LACT | FDAT | EVENT | DATE | CDAT | |
|---|---|---|---|---|---|---|
| 0 | 46 | 1 | 2011-09-23 00:00:00 | BRED | 2012-03-02 00:00:00 | NaT |
| 1 | 46 | 1 | 2011-09-23 00:00:00 | PREG | 2012-04-03 00:00:00 | 2012-03-02 00:00:00 |
| 2 | 46 | 1 | 2011-09-23 00:00:00 | PREG | 2012-05-22 00:00:00 | 2012-03:02 00:00:00 |
| 3 | 46 | 1 | 2011-09-23 00:00:00 | PREG | 2012-10-09 00:00:00 | 2012-03-02 00:00:00 |
| 4 | 46 | 2 | 2012-11-15 00:00:00 | FRESH | 2012-11-15 00:00:00 | NaT |
| 7 | 46 | 2 | 2012-11-15 00:00:00 | BRED | 2013-01-01 00:00:00 | NaT |
| 5 | 46 | 2 | 2012-11-15 00:00:00 | LUT | 2013-01-08 00:00:00 | NaT |
| 6 | 46 | 2 | 2012-11-15 00:00:00 | OS | 2013-01-15 00:00:00 | NaT |
| 8 | 46 | 2 | 2012-11-15 00:00:00 | BRED | 2013-01-24 00:00:00 | NaT |
| 9 | 46 | 2 | 2012-11-15 00:00:00 | PREG | 2013-02-26 00:00:00 | 2013-01-24 00:00:00 |
| 10 | 46 | 2 | 2012-11-16 00:00:00 | BRED | 2013-03-10 00:00:00 | NaT |
英文:
This should work..
import pandas as pd
import numpy as np
df = pd.DataFrame([[46,1,pd.Timestamp('2011-09-23'),'BRED',pd.Timestamp('2012-03-02')],
[46,1,pd.Timestamp('2011-09-23'),'PREG',pd.Timestamp('2012-04-03')],
[46,1,pd.Timestamp('2011-09-23'),'PREG',pd.Timestamp('2012-05-22')],
[46,1,pd.Timestamp('2011-09-23'),'PREG',pd.Timestamp('2012-10-09')],
[46,2,pd.Timestamp('2012-11-15'),'FRESH',pd.Timestamp('2012-11-15')],
[46,2,pd.Timestamp('2012-11-15'),'LUT',pd.Timestamp('2013-01-08')],
[46,2,pd.Timestamp('2012-11-15'),'OS',pd.Timestamp('2013-01-15')],
[46,2,pd.Timestamp('2012-11-15'),'BRED',pd.Timestamp('2013-01-01')],
[46,2,pd.Timestamp('2012-11-15'),'BRED',pd.Timestamp('2013-01-24')],
[46,2,pd.Timestamp('2012-11-15'),'PREG',pd.Timestamp('2013-02-26')],
[46,2,pd.Timestamp('2012-11-16'),'BRED',pd.Timestamp('2013-03-10')]],
columns=['ID', 'LACT', 'FDAT', 'EVENT', 'DATE'])
df = df.sort_values(['ID', 'LACT', 'FDAT', 'DATE'])
last_bred_dates = []
for name, group in df.groupby(['ID', 'LACT', 'FDAT']):
last_bred_date = np.nan
for i, row in group.iterrows():
if row['EVENT'] == 'BRED':
last_bred_date = row['DATE']
last_bred_dates.append(np.nan)
elif row['EVENT'] == 'PREG':
last_bred_dates.append(last_bred_date)
else:
last_bred_dates.append(np.nan)
df['CDAT'] = pd.Series(last_bred_dates)
Output:
| ID | LACT | FDAT | EVENT | DATE | CDAT | |
|---|---|---|---|---|---|---|
| 0 | 46 | 1 | 2011-09-23 00:00:00 | BRED | 2012-03-02 00:00:00 | NaT |
| 1 | 46 | 1 | 2011-09-23 00:00:00 | PREG | 2012-04-03 00:00:00 | 2012-03-02 00:00:00 |
| 2 | 46 | 1 | 2011-09-23 00:00:00 | PREG | 2012-05-22 00:00:00 | 2012-03-02 00:00:00 |
| 3 | 46 | 1 | 2011-09-23 00:00:00 | PREG | 2012-10-09 00:00:00 | 2012-03-02 00:00:00 |
| 4 | 46 | 2 | 2012-11-15 00:00:00 | FRESH | 2012-11-15 00:00:00 | NaT |
| 7 | 46 | 2 | 2012-11-15 00:00:00 | BRED | 2013-01-01 00:00:00 | NaT |
| 5 | 46 | 2 | 2012-11-15 00:00:00 | LUT | 2013-01-08 00:00:00 | NaT |
| 6 | 46 | 2 | 2012-11-15 00:00:00 | OS | 2013-01-15 00:00:00 | NaT |
| 8 | 46 | 2 | 2012-11-15 00:00:00 | BRED | 2013-01-24 00:00:00 | NaT |
| 9 | 46 | 2 | 2012-11-15 00:00:00 | PREG | 2013-02-26 00:00:00 | 2013-01-24 00:00:00 |
| 10 | 46 | 2 | 2012-11-16 00:00:00 | BRED | 2013-03-10 00:00:00 | NaT |
Explanation:
Group the df based on ['ID', 'LACT', 'FDAT'] to get the desired groups. Then create an empty list and iterate on that groups, if the EVENTof that row is a BRED Event save the DATE value and append a NaN to the list, if the EVENTof that row is a PREG Event append the saved value to the list, with any other event append a NaN to the list. Finally use that list to create the new CDAT column.
Note that before iterating on each group the variable last_bred_date is assigned with a NaN in order to append to the list only the dates of that group.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论