使用 Pandas 数据框的日期列创建额外行。

huangapple go评论96阅读模式
英文:

Create extra rows using date column pandas dataframe

问题

假设我有以下数据:

  1. ID 请假类型 开始日期 结束日期
  2. 1 病假 2022-01-01 2022-01-01
  3. 1 假期 2023-03-28
  4. 2 假期 2023-01-01 2023-01-02
  5. 3 工作 2023-01-01 2023-01-01

我需要找到一种方法来确认开始日期和结束日期是否具有相同的值。如果不是,它需要计算结束日期提前了多少天,并为每一天创建一行,始终匹配开始日期和结束日期。如果结束日期为空,它应该创建行,直到达到2023-03-30这一天。这样会得到以下数据:

  1. ID 请假类型 开始日期 结束日期
  2. 1 病假 2022-01-01 2022-01-01
  3. 1 假期 2023-03-28 2023-03-28
  4. 1 假期 2023-03-29 2023-03-29
  5. 1 假期 2023-03-30 2023-03-30
  6. 1 假期 2023-03-31 2023-03-31
  7. 2 假期 2023-01-01 2023-01-01
  8. 2 假期 2023-01-02 2023-01-02
  9. 3 工作 2023-01-01 2023-01-01

谢谢!

英文:

Imagine I have the following data:

  1. ID Leave Type Start Date End Date
  2. 1 Sick 2022-01-01 2022-01-01
  3. 1 Holiday 2023-03-28
  4. 2 Holiday 2023-01-01 2023-01-02
  5. 3 Work 2023-01-01 2023-01-01

I need to find a way to confirm Start Date and End Date have the same value. In case it is not, it needs to count the number of days the End Date is ahead and, for each day, create a row adding 1 day and always matching Start Date and End Date. If End Date is blank, it should create rows until it reaches the day of 2023-03-30. This way resulting on this data:

  1. ID Leave Type Start Date End Date
  2. 1 Sick 2022-01-01 2022-01-01
  3. 1 Holiday 2023-03-28 2023-03-28
  4. 1 Holiday 2023-03-29 2023-03-29
  5. 1 Holiday 2023-03-30 2023-03-30
  6. 1 Holiday 2023-03-31 2023-03-31
  7. 2 Holiday 2023-01-01 2023-01-01
  8. 2 Holiday 2023-01-02 2023-01-02
  9. 3 Work 2023-01-01 2023-01-01

Thank you!

答案1

得分: 0

你可以使用以下代码:

  1. # 确保将日期时间列,并用默认日期填充缺失值
  2. df[['Start Date', 'End Date']] = df[['Start Date', 'End Date']].apply(pd.to_datetime)
  3. df['End Date'] = df['End Date'].fillna('2023-03-30')
  4. # 重复索引并创建输出
  5. idx = df.index.repeat(df['End Date'].sub(df['Start Date']).dt.days.add(1))
  6. out = df.loc[idx]
  7. # 增加日期
  8. out['Start Date'] += pd.TimedeltaIndex(out.groupby(level=0).cumcount(), unit='D')
  9. out['End Date'] = out['Start Date']

输出:

  1. ID Leave Type Start Date End Date
  2. 0 1 Sick 2022-01-01 2022-01-01
  3. 1 1 Holiday 2023-03-28 2023-03-28
  4. 1 1 Holiday 2023-03-29 2023-03-29
  5. 1 1 Holiday 2023-03-30 2023-03-30
  6. 2 2 Holiday 2023-01-01 2023-01-01
  7. 2 2 Holiday 2023-01-02 2023-01-02
  8. 3 3 Work 2023-01-01 2023-01-01

可重现的输入数据:

  1. df = pd.DataFrame({'ID': [1, 1, 2, 3],
  2. 'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'],
  3. 'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
  4. 'End Date': ['2022-01-01', None, '2023-01-02', '2023-01-01']})
英文:

You can use:

  1. # ensure datetime and fill NA with default date
  2. df[['Start Date', 'End Date']] = df[['Start Date', 'End Date']].apply(pd.to_datetime)
  3. df['End Date'] = df['End Date'].fillna('2023-03-30')
  4. # repeat index and create output
  5. idx = df.index.repeat(df['End Date'].sub(df['Start Date']).dt.days.add(1))
  6. out = df.loc[idx]
  7. # increment days
  8. out['Start Date'] += pd.TimedeltaIndex(out.groupby(level=0).cumcount(), unit='D')
  9. out['End Date'] = out['Start Date']

Output:

  1. ID Leave Type Start Date End Date
  2. 0 1 Sick 2022-01-01 2022-01-01
  3. 1 1 Holiday 2023-03-28 2023-03-28
  4. 1 1 Holiday 2023-03-29 2023-03-29
  5. 1 1 Holiday 2023-03-30 2023-03-30
  6. 2 2 Holiday 2023-01-01 2023-01-01
  7. 2 2 Holiday 2023-01-02 2023-01-02
  8. 3 3 Work 2023-01-01 2023-01-01

Reproducible input:

  1. df = pd.DataFrame({'ID': [1, 1, 2, 3],
  2. 'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'],
  3. 'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
  4. 'End Date': ['2022-01-01', None, '2023-01-02', '2023-01-01']})

答案2

得分: 0

以下是您要翻译的内容:

  1. 假设您错误地粘贴了额外的行5到输出中您也可以尝试以下方法
  2. import pandas as pd
  3. from datetime import timedelta, datetime
  4. # 创建数据框
  5. df = pd.DataFrame({'ID': [1, 1, 2, 3],
  6. '请假类型': ['病假', '假期', '假期', '工作'],
  7. '开始日期': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
  8. '结束日期': ['2022-01-01', '', '2023-01-02', '2023-01-01']})
  9. # 将日期列转换为日期时间格式
  10. df['开始日期'] = pd.to_datetime(df['开始日期'])
  11. df['结束日期'] = pd.to_datetime(df['结束日期'])
  12. # 使用最大日期值填充空的结束日期
  13. df['结束日期'] = df['结束日期'].fillna(datetime(2023, 3, 30))
  14. # 创建一个空列表来存储新行
  15. 创建行 = []
  16. # 遍历数据框中的每一行
  17. for index, row in df.iterrows():
  18. # 如果开始日期和结束日期不同,为之间的每一天添加行
  19. if row['开始日期'] != row['结束日期']:
  20. delta = row['结束日期'] - row['开始日期']
  21. for i in range(delta.days + 1):
  22. date = row['开始日期'] + timedelta(days=i)
  23. 创建行.append({'ID': row['ID'], '请假类型': row['请假类型'], '开始日期': date, '结束日期': date})
  24. # 如果开始日期和结束日期相同,附加原始行
  25. else:
  26. 创建行.append({'ID': row['ID'], '请假类型': row['请假类型'], '开始日期': row['开始日期'], '结束日期': row['结束日期']}
  27. # 创建一个包含原始行和新行的新数据框
  28. output_df = pd.DataFrame(创建行)
  29. # 按ID和开始日期对数据框进行排序
  30. output_df = output_df.sort_values(['ID', '开始日期'])
  31. # 重置索引
  32. output_df = output_df.reset_index(drop=True)
  33. print(output_df)

请注意,我已将原始内容中的HTML实体代码(如''')转换为对应的字符。

英文:

Assuming that you incorrectly pasted an extra row(5th row) in the output. You can try this as well:

  1. import pandas as pd
  2. from datetime import timedelta, datetime
  3. # create the dataframe
  4. df = pd.DataFrame({'ID': [1, 1, 2, 3],
  5. 'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'],
  6. 'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
  7. 'End Date': ['2022-01-01', '', '2023-01-02', '2023-01-01']})
  8. # convert date columns to datetime format
  9. df['Start Date'] = pd.to_datetime(df['Start Date'])
  10. df['End Date'] = pd.to_datetime(df['End Date'])
  11. # fill in blank end dates with a maximum date value
  12. df['End Date'] = df['End Date'].fillna(datetime(2023, 3, 30))
  13. # create an empty list to store new rows
  14. create_rows = []
  15. # loop through each row in the dataframe
  16. for index, row in df.iterrows():
  17. # if the start and end dates are not the same, add rows for each day in between
  18. if row['Start Date'] != row['End Date']:
  19. delta = row['End Date'] - row['Start Date']
  20. for i in range(delta.days + 1):
  21. date = row['Start Date'] + timedelta(days=i)
  22. create_rows.append({'ID': row['ID'], 'Leave Type': row['Leave Type'], 'Start Date': date, 'End Date': date})
  23. # if the start and end dates are the same, append the original row
  24. else:
  25. create_rows.append({'ID': row['ID'], 'Leave Type': row['Leave Type'], 'Start Date': row['Start Date'], 'End Date': row['End Date']})
  26. # create a new dataframe with the original rows and the new rows
  27. output_df = pd.DataFrame(create_rows)
  28. # sort the dataframe by ID and Start Date
  29. output_df = output_df.sort_values(['ID', 'Start Date'])
  30. # reset the index
  31. output_df = output_df.reset_index(drop=True)
  32. print(output_df)

答案3

得分: 0

  1. import pandas as pd
  2. from pandas.tseries.offsets import MonthEnd
  3. df = pd.DataFrame({'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'],
  4. 'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
  5. 'End Date': ['2022-01-01', '', '2023-01-02', '2023-01-01']})
  6. # Converts columns 'Leave Type' and 'Start Date' to datetime
  7. df[['Start Date', 'End Date']] = \
  8. df[['Start Date', 'End Date']].apply(pd.to_datetime, errors='coerce')
  9. # Fill NaT values with the last day of the month
  10. df['End Date'] = df['End Date'].fillna(df['Start Date'] + MonthEnd(0))
  11. # Replace 'Start Date' values with list of date ranges
  12. df['End Date'] = \
  13. [pd.date_range(s, e, freq='D').tolist() for s,e in zip(df['Start Date'], df['End Date'])]
  14. # Explode the list
  15. df = df.explode('End Date')
  16. df['Start Date'] = df['End Date']
  17. print(df)
英文:
  1. import pandas as pd
  2. from pandas.tseries.offsets import MonthEnd
  3. df = pd.DataFrame({'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'],
  4. 'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
  5. 'End Date': ['2022-01-01', '', '2023-01-02', '2023-01-01'],
  6. })
  7. # Converts columns 'Leave Type' and 'Start Date' to datetime
  8. df[['Start Date', 'End Date']] = \
  9. df[['Start Date', 'End Date']].apply(pd.to_datetime, errors='coerce')
  10. # Fill NaT values with the last day of the month
  11. df['End Date'] = df['End Date'].fillna(df['Start Date'] + MonthEnd(0))
  12. # Replace 'Start Date' values with list of date ranges
  13. df['End Date'] = \
  14. [pd.date_range(s, e, freq='D').tolist() for s,e in zip(df['Start Date'], df['End Date'])]
  15. # Explode the list
  16. df = df.explode('End Date')
  17. df['Start Date'] = df['End Date']
  18. print(df)

Result

  1. Leave Type Start Date End Date
  2. 0 Sick 2022-01-01 2022-01-01
  3. 1 Holiday 2023-03-28 2023-03-28
  4. 1 Holiday 2023-03-29 2023-03-29
  5. 1 Holiday 2023-03-30 2023-03-30
  6. 1 Holiday 2023-03-31 2023-03-31
  7. 2 Holiday 2023-01-01 2023-01-01
  8. 2 Holiday 2023-01-02 2023-01-02
  9. 3 Work 2023-01-01 2023-01-01

huangapple
  • 本文由 发表于 2023年3月31日 22:43:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/75899840.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定