英文:
Extracting date values from a list column and placing them into new columns
问题
我有一个包含列ABDAT的数据框,其中包含一个列表。ABDAT中的元素数量是可变的。ABDAT中的大多数元素是NaT值。ABDAT中最多有三个有效日期。如果ABDAT中有一个日期,我想将其放入一个名为ABDAT1的新列中。如果ABDAT中有两个有效日期,我想将第一个日期放入ABDAT1,第二个日期放入ABDAT2。如果ABDAT中有三个日期,我想将第一个日期放入ABDAT1,第二个日期放入ABDAT2,第三个日期放入ABDAT3。以下是数据框中ABDAT列的示例。
我想创建以下结果:
数据框非常大(800,000行)
谢谢
英文:
I have a dataframe that includes a column ABDAT that contains a list. The number of elements in ABDAT is variable. Most of the elements in ABDAT are NaT values. There is upto three valid dates in ABDAT. If ABDAT has one date I would like to put it in a new column called ABDAT1. It ABDAT two valid dates I would like to put the first date in ABDAT1 and the second date into ABDAT2. If ABDAT has three dates I would like to put the first in ABDAT1, the second in ABDAT2 and the third in ABDAT3. The following is examples of the ABDAT column in the dataframe.
ID FDAT ABDAT
0 1 2004-08-17 [2004-08-17 00:00:00, NaT, NaT, NaT]
1 1 2005-07-10 [NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT]
2 1 2006-06-12 [NaT, NaT]
3 3 2001-12-20 [NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT]
4 3 2003-07-14 [2001-02-17 00:00:00, NaT, NaT, NaT, NaT, NaT]
5 3 2004-06-01 [NaT, 2012-12-21 00:00:00, NaT, NaT, NaT, NaT, NaT, 2018-05-01 00:00:00, NaT]
I would like to create
ID FDAT ABDAT1 ABDAT2 ABDAT3
0 1 2004-08-17 2004-08-17 NaT NaT
1 1 2005-07-10 NaT NaT NaT
2 1 2006-06-12 NaT NaT NaT
3 3 2001-12-20 NaT NaT NaT
4 3 2003-07-14 2001-02-17 NaT NaT
5 3 2004-06-01 2012-12-21 2018-05-01 NaT
The dataframe is quite large (800000 rows)
Thanks
答案1
得分: 1
这是您提供的代码的翻译结果:
import pandas as pd
import numpy as np
data = {'ID': [1, 1, 1, 3, 3, 3],
'FDAT': ['2004-08-17', '2005-07-10', '2006-06-12', '2001-12-20', '2003-07-14', '2004-06-01'],
'ABDAT': [['2004-08-17 00:00:00', 'NaT', 'NaT', 'NaT'],
['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
['NaT', 'NaT'],
['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
['2001-02-17 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
['NaT', '2012-12-21 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', '2018-05-01 00:00:00', 'NaT']]
}
df = pd.DataFrame(data)
def extract_dates(row):
dates = [d for d in row['ABDAT'] if not pd.isna(d)]
num_dates = len(dates)
for i in range(num_dates):
col_name = f'ABDAT{i+1}'
row[col_name] = dates[i]
return row
df = df.apply(extract_dates, axis=1)
df.replace([np.datetime64('NaT')], [np.nan], inplace=True)
print(df)
这段代码将产生以下输出:
ABDAT ABDAT1 \
0 ['2004-08-17 00:00:00', 'NaT', 'NaT', 'NaT'] 2004-08-17 00:00:00
1 ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'] NaT
2 ['NaT', 'NaT'] NaT
3 ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'] NaT
4 ['2001-02-17 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'] 2001-02-17 00:00:00
5 ['NaT', '2012-12-21 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', '2018-05-01 00:00:00', 'NaT'] NaT
ABDAT10 ABDAT11 ABDAT12 ABDAT13 ABDAT2 ABDAT3 ABDAT4 ABDAT5 \
0 NaN NaN NaN NaN NaT NaT NaT NaN
1 NaN NaN NaN NaN NaT NaT NaT NaT
2 NaN NaN NaN NaN NaT NaN NaN NaN
3 NaT NaT NaT NaT NaT NaT NaT NaT
4 NaN NaN NaN NaN NaT NaT NaT NaT
5 NaN NaN NaN NaN 2012-12-21 00:00:00 NaT NaT NaT
ABDAT6 ABDAT7 ABDAT8 ABDAT9 FDAT ID
0 NaN NaN NaN NaN 2004-08-17 1
1 NaT NaT NaT NaT 2005-07-10 1
2 NaN NaN NaN NaN 2006-06-12 1
3 NaT NaT NaT NaT 2001-12-20 3
4 NaT NaN NaN NaN 2003-07-14 3
5 NaT NaT 2018-05-01 00:00:00 NaT 2004-06-01 3
这段代码与您提供的原始代码功能相同,但通用性更强,可以处理更多的日期列。
英文:
It is failry simple:
# Import the pandas library
import pandas as pd
# Create a sample dataframe
data = {'ID': [1, 1, 1, 3, 3, 3],
'FDAT': ['2004-08-17', '2005-07-10', '2006-06-12', '2001-12-20', '2003-07-14', '2004-06-01'],
'ABDAT': [['2004-08-17 00:00:00', pd.NaT, pd.NaT, pd.NaT],
[pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT],
[pd.NaT, pd.NaT],
[pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT],
['2001-02-17 00:00:00', pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT],
[pd.NaT, '2012-12-21 00:00:00', pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, '2018-05-01 00:00:00', pd.NaT]]}
df = pd.DataFrame(data)
df['ABDAT1'] = pd.NaT
df['ABDAT2'] = pd.NaT
df['ABDAT3'] = pd.NaT
for i in range(len(df)):
dates = [date for date in df.loc[i, 'ABDAT'] if not pd.isnull(date)]
if len(dates) >= 1:
df.at[i, 'ABDAT1'] = dates[0]
if len(dates) >= 2:
df.at[i, 'ABDAT2'] = dates[1]
if len(dates) >= 3:
df.at[i, 'ABDAT3'] = dates[2]
df = df.drop('ABDAT', axis = 1)
print(df)
which gives you
ID FDAT ABDAT1 ABDAT2 ABDAT3
0 1 2004-08-17 2004-08-17 NaT NaT
1 1 2005-07-10 NaT NaT NaT
2 1 2006-06-12 NaT NaT NaT
3 3 2001-12-20 NaT NaT NaT
4 3 2003-07-14 2001-02-17 NaT NaT
5 3 2004-06-01 2012-12-21 2018-05-01 NaT
But, here I assume you have no more that 3 possible dates. You can generralize it this way:
import pandas as pd
import numpy as np
data = {'ID': [1, 1, 1, 3, 3, 3],
'FDAT': ['2004-08-17', '2005-07-10', '2006-06-12', '2001-12-20', '2003-07-14', '2004-06-01'],
'ABDAT': [['2004-08-17 00:00:00', 'NaT', 'NaT', 'NaT'],
['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
['NaT', 'NaT'],
['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
['2001-02-17 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
['NaT', '2012-12-21 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', '2018-05-01 00:00:00', 'NaT']]
}
df = pd.DataFrame(data)
def extract_dates(row):
dates = [d for d in row['ABDAT'] if not pd.isna(d)]
num_dates = len(dates)
for i in range(num_dates):
col_name = f'ABDAT{i+1}'
row[col_name] = dates[i]
return row
df = df.apply(extract_dates, axis=1)
df.replace([np.datetime64('NaT')], [np.nan], inplace=True)
print(df)
which gives
ABDAT ABDAT1 \
0 [2004-08-17 00:00:00, NaT, NaT, NaT] 2004-08-17 00:00:00
1 [NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT] NaT
2 [NaT, NaT] NaT
3 [NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, ... NaT
4 [2001-02-17 00:00:00, NaT, NaT, NaT, NaT, NaT] 2001-02-17 00:00:00
5 [NaT, 2012-12-21 00:00:00, NaT, NaT, NaT, NaT,... NaT
ABDAT10 ABDAT11 ABDAT12 ABDAT13 ABDAT2 ABDAT3 ABDAT4 ABDAT5 \
0 NaN NaN NaN NaN NaT NaT NaT NaN
1 NaN NaN NaN NaN NaT NaT NaT NaT
2 NaN NaN NaN NaN NaT NaN NaN NaN
3 NaT NaT NaT NaT NaT NaT NaT NaT
4 NaN NaN NaN NaN NaT NaT NaT NaT
5 NaN NaN NaN NaN 2012-12-21 00:00:00 NaT NaT NaT
ABDAT6 ABDAT7 ABDAT8 ABDAT9 FDAT ID
0 NaN NaN NaN NaN 2004-08-17 1
1 NaT NaT NaT NaT 2005-07-10 1
2 NaN NaN NaN NaN 2006-06-12 1
3 NaT NaT NaT NaT 2001-12-20 3
4 NaT NaN NaN NaN 2003-07-14 3
5 NaT NaT 2018-05-01 00:00:00 NaT 2004-06-01 3
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论