从列表列中提取日期数值并将它们放入新列中

huangapple go评论72阅读模式
英文:

Extracting date values from a list column and placing them into new columns

问题

我有一个包含列ABDAT的数据框,其中包含一个列表。ABDAT中的元素数量是可变的。ABDAT中的大多数元素是NaT值。ABDAT中最多有三个有效日期。如果ABDAT中有一个日期,我想将其放入一个名为ABDAT1的新列中。如果ABDAT中有两个有效日期,我想将第一个日期放入ABDAT1,第二个日期放入ABDAT2。如果ABDAT中有三个日期,我想将第一个日期放入ABDAT1,第二个日期放入ABDAT2,第三个日期放入ABDAT3。以下是数据框中ABDAT列的示例。

我想创建以下结果:

数据框非常大(800,000行)

谢谢

英文:

I have a dataframe that includes a column ABDAT that contains a list. The number of elements in ABDAT is variable. Most of the elements in ABDAT are NaT values. There is upto three valid dates in ABDAT. If ABDAT has one date I would like to put it in a new column called ABDAT1. It ABDAT two valid dates I would like to put the first date in ABDAT1 and the second date into ABDAT2. If ABDAT has three dates I would like to put the first in ABDAT1, the second in ABDAT2 and the third in ABDAT3. The following is examples of the ABDAT column in the dataframe.

	ID	  FDAT	      ABDAT
0	1	2004-08-17	[2004-08-17 00:00:00, NaT, NaT, NaT]
1	1	2005-07-10	[NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT]
2	1	2006-06-12	[NaT, NaT]
3	3	2001-12-20	[NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT]
4	3	2003-07-14	[2001-02-17 00:00:00, NaT, NaT, NaT, NaT, NaT]
5	3	2004-06-01	[NaT, 2012-12-21 00:00:00, NaT, NaT, NaT, NaT, NaT, 2018-05-01 00:00:00, NaT]

I would like to create

	ID	  FDAT	      ABDAT1       ABDAT2        ABDAT3
0	1	2004-08-17	2004-08-17      NaT            NaT
1	1	2005-07-10	NaT             NaT            NaT
2	1	2006-06-12	NaT             NaT            NaT
3	3	2001-12-20	NaT             NaT            NaT
4	3	2003-07-14	2001-02-17      NaT            NaT
5	3	2004-06-01	2012-12-21    2018-05-01       NaT

The dataframe is quite large (800000 rows)

Thanks

答案1

得分: 1

这是您提供的代码的翻译结果:

import pandas as pd
import numpy as np

data = {'ID': [1, 1, 1, 3, 3, 3],
        'FDAT': ['2004-08-17', '2005-07-10', '2006-06-12', '2001-12-20', '2003-07-14', '2004-06-01'],
        'ABDAT': [['2004-08-17 00:00:00', 'NaT', 'NaT', 'NaT'],
                  ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
                  ['NaT', 'NaT'],
                  ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
                  ['2001-02-17 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
                  ['NaT', '2012-12-21 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', '2018-05-01 00:00:00', 'NaT']]
        }

df = pd.DataFrame(data)

def extract_dates(row):
    dates = [d for d in row['ABDAT'] if not pd.isna(d)]
    num_dates = len(dates)
    for i in range(num_dates):
        col_name = f'ABDAT{i+1}'
        row[col_name] = dates[i]
    return row

df = df.apply(extract_dates, axis=1)

df.replace([np.datetime64('NaT')], [np.nan], inplace=True)

print(df)

这段代码将产生以下输出:

                                              ABDAT               ABDAT1  \
0               ['2004-08-17 00:00:00', 'NaT', 'NaT', 'NaT']  2004-08-17 00:00:00   
1      ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT']                  NaT   
2                                         ['NaT', 'NaT']                  NaT   
3  ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT']                  NaT   
4     ['2001-02-17 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT']  2001-02-17 00:00:00   
5  ['NaT', '2012-12-21 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', '2018-05-01 00:00:00', 'NaT']                  NaT   

  ABDAT10 ABDAT11 ABDAT12 ABDAT13               ABDAT2 ABDAT3 ABDAT4 ABDAT5  \
0     NaN     NaN     NaN     NaN                  NaT    NaT    NaT    NaN   
1     NaN     NaN     NaN     NaN                  NaT    NaT    NaT    NaT   
2     NaN     NaN     NaN     NaN                  NaT    NaN    NaN    NaN   
3     NaT     NaT     NaT     NaT                  NaT    NaT    NaT    NaT   
4     NaN     NaN     NaN     NaN                  NaT    NaT    NaT    NaT   
5     NaN     NaN     NaN     NaN  2012-12-21 00:00:00    NaT    NaT    NaT   

  ABDAT6 ABDAT7               ABDAT8 ABDAT9        FDAT  ID  
0    NaN    NaN                  NaN    NaN  2004-08-17   1  
1    NaT    NaT                  NaT    NaT  2005-07-10   1  
2    NaN    NaN                  NaN    NaN  2006-06-12   1  
3    NaT    NaT                  NaT    NaT  2001-12-20   3  
4    NaT    NaN                  NaN    NaN  2003-07-14   3  
5    NaT    NaT  2018-05-01 00:00:00    NaT  2004-06-01   3  

这段代码与您提供的原始代码功能相同,但通用性更强,可以处理更多的日期列。

英文:

It is failry simple:

# Import the pandas library
import pandas as pd

# Create a sample dataframe
data = {'ID': [1, 1, 1, 3, 3, 3],
        'FDAT': ['2004-08-17', '2005-07-10', '2006-06-12', '2001-12-20', '2003-07-14', '2004-06-01'],
        'ABDAT': [['2004-08-17 00:00:00', pd.NaT, pd.NaT, pd.NaT],
                  [pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT],
                  [pd.NaT, pd.NaT],
                  [pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT],
                  ['2001-02-17 00:00:00', pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT],
                  [pd.NaT, '2012-12-21 00:00:00', pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, '2018-05-01 00:00:00', pd.NaT]]}

df = pd.DataFrame(data)

df['ABDAT1'] = pd.NaT
df['ABDAT2'] = pd.NaT
df['ABDAT3'] = pd.NaT

for i in range(len(df)):
    dates = [date for date in df.loc[i, 'ABDAT'] if not pd.isnull(date)]
    if len(dates) >= 1:
        df.at[i, 'ABDAT1'] = dates[0]
    if len(dates) >= 2:
        df.at[i, 'ABDAT2'] = dates[1]
    if len(dates) >= 3:
        df.at[i, 'ABDAT3'] = dates[2]

df = df.drop('ABDAT', axis = 1)
print(df)

which gives you

   ID        FDAT     ABDAT1     ABDAT2 ABDAT3
0   1  2004-08-17 2004-08-17        NaT    NaT
1   1  2005-07-10        NaT        NaT    NaT
2   1  2006-06-12        NaT        NaT    NaT
3   3  2001-12-20        NaT        NaT    NaT
4   3  2003-07-14 2001-02-17        NaT    NaT
5   3  2004-06-01 2012-12-21 2018-05-01    NaT

But, here I assume you have no more that 3 possible dates. You can generralize it this way:

import pandas as pd
import numpy as np

data = {'ID': [1, 1, 1, 3, 3, 3],
        'FDAT': ['2004-08-17', '2005-07-10', '2006-06-12', '2001-12-20', '2003-07-14', '2004-06-01'],
        'ABDAT': [['2004-08-17 00:00:00', 'NaT', 'NaT', 'NaT'],
                  ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
                  ['NaT', 'NaT'],
                  ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
                  ['2001-02-17 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
                  ['NaT', '2012-12-21 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', '2018-05-01 00:00:00', 'NaT']]
        }

df = pd.DataFrame(data)

def extract_dates(row):
    dates = [d for d in row['ABDAT'] if not pd.isna(d)]
    num_dates = len(dates)
    for i in range(num_dates):
        col_name = f'ABDAT{i+1}'
        row[col_name] = dates[i]
    return row

df = df.apply(extract_dates, axis=1)

df.replace([np.datetime64('NaT')], [np.nan], inplace=True)

print(df)

which gives

                                              ABDAT               ABDAT1  \
0               [2004-08-17 00:00:00, NaT, NaT, NaT]  2004-08-17 00:00:00   
1      [NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT]                  NaT   
2                                         [NaT, NaT]                  NaT   
3  [NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, ...                  NaT   
4     [2001-02-17 00:00:00, NaT, NaT, NaT, NaT, NaT]  2001-02-17 00:00:00   
5  [NaT, 2012-12-21 00:00:00, NaT, NaT, NaT, NaT,...                  NaT   

  ABDAT10 ABDAT11 ABDAT12 ABDAT13               ABDAT2 ABDAT3 ABDAT4 ABDAT5  \
0     NaN     NaN     NaN     NaN                  NaT    NaT    NaT    NaN   
1     NaN     NaN     NaN     NaN                  NaT    NaT    NaT    NaT   
2     NaN     NaN     NaN     NaN                  NaT    NaN    NaN    NaN   
3     NaT     NaT     NaT     NaT                  NaT    NaT    NaT    NaT   
4     NaN     NaN     NaN     NaN                  NaT    NaT    NaT    NaT   
5     NaN     NaN     NaN     NaN  2012-12-21 00:00:00    NaT    NaT    NaT   

  ABDAT6 ABDAT7               ABDAT8 ABDAT9        FDAT  ID  
0    NaN    NaN                  NaN    NaN  2004-08-17   1  
1    NaT    NaT                  NaT    NaT  2005-07-10   1  
2    NaN    NaN                  NaN    NaN  2006-06-12   1  
3    NaT    NaT                  NaT    NaT  2001-12-20   3  
4    NaT    NaN                  NaN    NaN  2003-07-14   3  
5    NaT    NaT  2018-05-01 00:00:00    NaT  2004-06-01   3  

huangapple
  • 本文由 发表于 2023年3月4日 08:29:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/75632910.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定