从列表列中提取日期数值并将它们放入新列中

huangapple go评论96阅读模式
英文:

Extracting date values from a list column and placing them into new columns

问题

我有一个包含列ABDAT的数据框,其中包含一个列表。ABDAT中的元素数量是可变的。ABDAT中的大多数元素是NaT值。ABDAT中最多有三个有效日期。如果ABDAT中有一个日期,我想将其放入一个名为ABDAT1的新列中。如果ABDAT中有两个有效日期,我想将第一个日期放入ABDAT1,第二个日期放入ABDAT2。如果ABDAT中有三个日期,我想将第一个日期放入ABDAT1,第二个日期放入ABDAT2,第三个日期放入ABDAT3。以下是数据框中ABDAT列的示例。

我想创建以下结果:

数据框非常大(800,000行)

谢谢

英文:

I have a dataframe that includes a column ABDAT that contains a list. The number of elements in ABDAT is variable. Most of the elements in ABDAT are NaT values. There is upto three valid dates in ABDAT. If ABDAT has one date I would like to put it in a new column called ABDAT1. It ABDAT two valid dates I would like to put the first date in ABDAT1 and the second date into ABDAT2. If ABDAT has three dates I would like to put the first in ABDAT1, the second in ABDAT2 and the third in ABDAT3. The following is examples of the ABDAT column in the dataframe.

  1. ID FDAT ABDAT
  2. 0 1 2004-08-17 [2004-08-17 00:00:00, NaT, NaT, NaT]
  3. 1 1 2005-07-10 [NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT]
  4. 2 1 2006-06-12 [NaT, NaT]
  5. 3 3 2001-12-20 [NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT]
  6. 4 3 2003-07-14 [2001-02-17 00:00:00, NaT, NaT, NaT, NaT, NaT]
  7. 5 3 2004-06-01 [NaT, 2012-12-21 00:00:00, NaT, NaT, NaT, NaT, NaT, 2018-05-01 00:00:00, NaT]

I would like to create

  1. ID FDAT ABDAT1 ABDAT2 ABDAT3
  2. 0 1 2004-08-17 2004-08-17 NaT NaT
  3. 1 1 2005-07-10 NaT NaT NaT
  4. 2 1 2006-06-12 NaT NaT NaT
  5. 3 3 2001-12-20 NaT NaT NaT
  6. 4 3 2003-07-14 2001-02-17 NaT NaT
  7. 5 3 2004-06-01 2012-12-21 2018-05-01 NaT

The dataframe is quite large (800000 rows)

Thanks

答案1

得分: 1

这是您提供的代码的翻译结果:

  1. import pandas as pd
  2. import numpy as np
  3. data = {'ID': [1, 1, 1, 3, 3, 3],
  4. 'FDAT': ['2004-08-17', '2005-07-10', '2006-06-12', '2001-12-20', '2003-07-14', '2004-06-01'],
  5. 'ABDAT': [['2004-08-17 00:00:00', 'NaT', 'NaT', 'NaT'],
  6. ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
  7. ['NaT', 'NaT'],
  8. ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
  9. ['2001-02-17 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
  10. ['NaT', '2012-12-21 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', '2018-05-01 00:00:00', 'NaT']]
  11. }
  12. df = pd.DataFrame(data)
  13. def extract_dates(row):
  14. dates = [d for d in row['ABDAT'] if not pd.isna(d)]
  15. num_dates = len(dates)
  16. for i in range(num_dates):
  17. col_name = f'ABDAT{i+1}'
  18. row[col_name] = dates[i]
  19. return row
  20. df = df.apply(extract_dates, axis=1)
  21. df.replace([np.datetime64('NaT')], [np.nan], inplace=True)
  22. print(df)

这段代码将产生以下输出:

  1. ABDAT ABDAT1 \
  2. 0 ['2004-08-17 00:00:00', 'NaT', 'NaT', 'NaT'] 2004-08-17 00:00:00
  3. 1 ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'] NaT
  4. 2 ['NaT', 'NaT'] NaT
  5. 3 ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'] NaT
  6. 4 ['2001-02-17 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'] 2001-02-17 00:00:00
  7. 5 ['NaT', '2012-12-21 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', '2018-05-01 00:00:00', 'NaT'] NaT
  8. ABDAT10 ABDAT11 ABDAT12 ABDAT13 ABDAT2 ABDAT3 ABDAT4 ABDAT5 \
  9. 0 NaN NaN NaN NaN NaT NaT NaT NaN
  10. 1 NaN NaN NaN NaN NaT NaT NaT NaT
  11. 2 NaN NaN NaN NaN NaT NaN NaN NaN
  12. 3 NaT NaT NaT NaT NaT NaT NaT NaT
  13. 4 NaN NaN NaN NaN NaT NaT NaT NaT
  14. 5 NaN NaN NaN NaN 2012-12-21 00:00:00 NaT NaT NaT
  15. ABDAT6 ABDAT7 ABDAT8 ABDAT9 FDAT ID
  16. 0 NaN NaN NaN NaN 2004-08-17 1
  17. 1 NaT NaT NaT NaT 2005-07-10 1
  18. 2 NaN NaN NaN NaN 2006-06-12 1
  19. 3 NaT NaT NaT NaT 2001-12-20 3
  20. 4 NaT NaN NaN NaN 2003-07-14 3
  21. 5 NaT NaT 2018-05-01 00:00:00 NaT 2004-06-01 3

这段代码与您提供的原始代码功能相同,但通用性更强,可以处理更多的日期列。

英文:

It is failry simple:

  1. # Import the pandas library
  2. import pandas as pd
  3. # Create a sample dataframe
  4. data = {'ID': [1, 1, 1, 3, 3, 3],
  5. 'FDAT': ['2004-08-17', '2005-07-10', '2006-06-12', '2001-12-20', '2003-07-14', '2004-06-01'],
  6. 'ABDAT': [['2004-08-17 00:00:00', pd.NaT, pd.NaT, pd.NaT],
  7. [pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT],
  8. [pd.NaT, pd.NaT],
  9. [pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT],
  10. ['2001-02-17 00:00:00', pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT],
  11. [pd.NaT, '2012-12-21 00:00:00', pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, '2018-05-01 00:00:00', pd.NaT]]}
  12. df = pd.DataFrame(data)
  13. df['ABDAT1'] = pd.NaT
  14. df['ABDAT2'] = pd.NaT
  15. df['ABDAT3'] = pd.NaT
  16. for i in range(len(df)):
  17. dates = [date for date in df.loc[i, 'ABDAT'] if not pd.isnull(date)]
  18. if len(dates) >= 1:
  19. df.at[i, 'ABDAT1'] = dates[0]
  20. if len(dates) >= 2:
  21. df.at[i, 'ABDAT2'] = dates[1]
  22. if len(dates) >= 3:
  23. df.at[i, 'ABDAT3'] = dates[2]
  24. df = df.drop('ABDAT', axis = 1)
  25. print(df)

which gives you

  1. ID FDAT ABDAT1 ABDAT2 ABDAT3
  2. 0 1 2004-08-17 2004-08-17 NaT NaT
  3. 1 1 2005-07-10 NaT NaT NaT
  4. 2 1 2006-06-12 NaT NaT NaT
  5. 3 3 2001-12-20 NaT NaT NaT
  6. 4 3 2003-07-14 2001-02-17 NaT NaT
  7. 5 3 2004-06-01 2012-12-21 2018-05-01 NaT

But, here I assume you have no more that 3 possible dates. You can generralize it this way:

  1. import pandas as pd
  2. import numpy as np
  3. data = {'ID': [1, 1, 1, 3, 3, 3],
  4. 'FDAT': ['2004-08-17', '2005-07-10', '2006-06-12', '2001-12-20', '2003-07-14', '2004-06-01'],
  5. 'ABDAT': [['2004-08-17 00:00:00', 'NaT', 'NaT', 'NaT'],
  6. ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
  7. ['NaT', 'NaT'],
  8. ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
  9. ['2001-02-17 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
  10. ['NaT', '2012-12-21 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', '2018-05-01 00:00:00', 'NaT']]
  11. }
  12. df = pd.DataFrame(data)
  13. def extract_dates(row):
  14. dates = [d for d in row['ABDAT'] if not pd.isna(d)]
  15. num_dates = len(dates)
  16. for i in range(num_dates):
  17. col_name = f'ABDAT{i+1}'
  18. row[col_name] = dates[i]
  19. return row
  20. df = df.apply(extract_dates, axis=1)
  21. df.replace([np.datetime64('NaT')], [np.nan], inplace=True)
  22. print(df)

which gives

  1. ABDAT ABDAT1 \
  2. 0 [2004-08-17 00:00:00, NaT, NaT, NaT] 2004-08-17 00:00:00
  3. 1 [NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT] NaT
  4. 2 [NaT, NaT] NaT
  5. 3 [NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, ... NaT
  6. 4 [2001-02-17 00:00:00, NaT, NaT, NaT, NaT, NaT] 2001-02-17 00:00:00
  7. 5 [NaT, 2012-12-21 00:00:00, NaT, NaT, NaT, NaT,... NaT
  8. ABDAT10 ABDAT11 ABDAT12 ABDAT13 ABDAT2 ABDAT3 ABDAT4 ABDAT5 \
  9. 0 NaN NaN NaN NaN NaT NaT NaT NaN
  10. 1 NaN NaN NaN NaN NaT NaT NaT NaT
  11. 2 NaN NaN NaN NaN NaT NaN NaN NaN
  12. 3 NaT NaT NaT NaT NaT NaT NaT NaT
  13. 4 NaN NaN NaN NaN NaT NaT NaT NaT
  14. 5 NaN NaN NaN NaN 2012-12-21 00:00:00 NaT NaT NaT
  15. ABDAT6 ABDAT7 ABDAT8 ABDAT9 FDAT ID
  16. 0 NaN NaN NaN NaN 2004-08-17 1
  17. 1 NaT NaT NaT NaT 2005-07-10 1
  18. 2 NaN NaN NaN NaN 2006-06-12 1
  19. 3 NaT NaT NaT NaT 2001-12-20 3
  20. 4 NaT NaN NaN NaN 2003-07-14 3
  21. 5 NaT NaT 2018-05-01 00:00:00 NaT 2004-06-01 3

huangapple
  • 本文由 发表于 2023年3月4日 08:29:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/75632910.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定