使用 Pandas 数据框的日期列创建额外行。

huangapple go评论70阅读模式
英文:

Create extra rows using date column pandas dataframe

问题

假设我有以下数据:

ID	请假类型	开始日期	  结束日期
1	病假	    2022-01-01	  2022-01-01
1	假期	    2023-03-28	
2	假期	    2023-01-01	  2023-01-02
3	工作	    2023-01-01	  2023-01-01

我需要找到一种方法来确认开始日期和结束日期是否具有相同的值。如果不是,它需要计算结束日期提前了多少天,并为每一天创建一行,始终匹配开始日期和结束日期。如果结束日期为空,它应该创建行,直到达到2023-03-30这一天。这样会得到以下数据:

ID	请假类型	开始日期	  结束日期
1	病假	    2022-01-01	  2022-01-01
1	假期	    2023-03-28	  2023-03-28
1	假期	    2023-03-29	  2023-03-29
1	假期	    2023-03-30	  2023-03-30
1	假期	    2023-03-31	  2023-03-31
2	假期	    2023-01-01	  2023-01-01
2	假期	    2023-01-02	  2023-01-02
3	工作	    2023-01-01	  2023-01-01

谢谢!

英文:

Imagine I have the following data:

ID	Leave Type	Start Date	  End Date
1	Sick	    2022-01-01	  2022-01-01
1	Holiday	    2023-03-28	
2	Holiday	    2023-01-01	  2023-01-02
3	Work	    2023-01-01	  2023-01-01

I need to find a way to confirm Start Date and End Date have the same value. In case it is not, it needs to count the number of days the End Date is ahead and, for each day, create a row adding 1 day and always matching Start Date and End Date. If End Date is blank, it should create rows until it reaches the day of 2023-03-30. This way resulting on this data:

ID	Leave Type	Start Date	  End Date
1	Sick	    2022-01-01	  2022-01-01
1	Holiday	    2023-03-28	  2023-03-28
1	Holiday	    2023-03-29	  2023-03-29
1	Holiday	    2023-03-30	  2023-03-30
1	Holiday	    2023-03-31	  2023-03-31
2	Holiday	    2023-01-01	  2023-01-01
2	Holiday	    2023-01-02	  2023-01-02
3	Work	    2023-01-01	  2023-01-01

Thank you!

答案1

得分: 0

你可以使用以下代码:

# 确保将日期时间列,并用默认日期填充缺失值
df[['Start Date', 'End Date']] = df[['Start Date', 'End Date']].apply(pd.to_datetime)
df['End Date'] = df['End Date'].fillna('2023-03-30')

# 重复索引并创建输出
idx = df.index.repeat(df['End Date'].sub(df['Start Date']).dt.days.add(1))
out = df.loc[idx]

# 增加日期
out['Start Date'] += pd.TimedeltaIndex(out.groupby(level=0).cumcount(), unit='D')
out['End Date'] = out['Start Date']

输出:

   ID Leave Type Start Date   End Date
0   1       Sick 2022-01-01 2022-01-01
1   1    Holiday 2023-03-28 2023-03-28
1   1    Holiday 2023-03-29 2023-03-29
1   1    Holiday 2023-03-30 2023-03-30
2   2    Holiday 2023-01-01 2023-01-01
2   2    Holiday 2023-01-02 2023-01-02
3   3       Work 2023-01-01 2023-01-01

可重现的输入数据:

df = pd.DataFrame({'ID': [1, 1, 2, 3],
                   'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'],
                   'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
                   'End Date': ['2022-01-01', None, '2023-01-02', '2023-01-01']})
英文:

You can use:

# ensure datetime and fill NA with default date
df[['Start Date', 'End Date']] = df[['Start Date', 'End Date']].apply(pd.to_datetime)
df['End Date'] = df['End Date'].fillna('2023-03-30')

# repeat index and create output
idx = df.index.repeat(df['End Date'].sub(df['Start Date']).dt.days.add(1))
out = df.loc[idx]

# increment days
out['Start Date'] += pd.TimedeltaIndex(out.groupby(level=0).cumcount(), unit='D')
out['End Date'] = out['Start Date']

Output:

   ID Leave Type Start Date   End Date
0   1       Sick 2022-01-01 2022-01-01
1   1    Holiday 2023-03-28 2023-03-28
1   1    Holiday 2023-03-29 2023-03-29
1   1    Holiday 2023-03-30 2023-03-30
2   2    Holiday 2023-01-01 2023-01-01
2   2    Holiday 2023-01-02 2023-01-02
3   3       Work 2023-01-01 2023-01-01

Reproducible input:

df = pd.DataFrame({'ID': [1, 1, 2, 3],
                   'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'],
                   'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
                   'End Date': ['2022-01-01', None, '2023-01-02', '2023-01-01']})

答案2

得分: 0

以下是您要翻译的内容:

假设您错误地粘贴了额外的行第5行到输出中您也可以尝试以下方法

import pandas as pd
from datetime import timedelta, datetime

# 创建数据框
df = pd.DataFrame({'ID': [1, 1, 2, 3],
                   '请假类型': ['病假', '假期', '假期', '工作'],
                   '开始日期': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
                   '结束日期': ['2022-01-01', '', '2023-01-02', '2023-01-01']})

# 将日期列转换为日期时间格式
df['开始日期'] = pd.to_datetime(df['开始日期'])
df['结束日期'] = pd.to_datetime(df['结束日期'])

# 使用最大日期值填充空的结束日期
df['结束日期'] = df['结束日期'].fillna(datetime(2023, 3, 30))

# 创建一个空列表来存储新行
创建行 = []

# 遍历数据框中的每一行
for index, row in df.iterrows():

    # 如果开始日期和结束日期不同,为之间的每一天添加行
    if row['开始日期'] != row['结束日期']:
        delta = row['结束日期'] - row['开始日期']
        for i in range(delta.days + 1):
            date = row['开始日期'] + timedelta(days=i)
            创建行.append({'ID': row['ID'], '请假类型': row['请假类型'], '开始日期': date, '结束日期': date})

    # 如果开始日期和结束日期相同,附加原始行
    else:
        创建行.append({'ID': row['ID'], '请假类型': row['请假类型'], '开始日期': row['开始日期'], '结束日期': row['结束日期']}

# 创建一个包含原始行和新行的新数据框
output_df = pd.DataFrame(创建行)

# 按ID和开始日期对数据框进行排序
output_df = output_df.sort_values(['ID', '开始日期'])

# 重置索引
output_df = output_df.reset_index(drop=True)

print(output_df)

请注意,我已将原始内容中的HTML实体代码(如''')转换为对应的字符。

英文:

Assuming that you incorrectly pasted an extra row(5th row) in the output. You can try this as well:

import pandas as pd
from datetime import timedelta, datetime

# create the dataframe
df = pd.DataFrame({'ID': [1, 1, 2, 3], 
                   'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'], 
                   'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'], 
                   'End Date': ['2022-01-01', '', '2023-01-02', '2023-01-01']})

# convert date columns to datetime format
df['Start Date'] = pd.to_datetime(df['Start Date'])
df['End Date'] = pd.to_datetime(df['End Date'])

# fill in blank end dates with a maximum date value
df['End Date'] = df['End Date'].fillna(datetime(2023, 3, 30))

# create an empty list to store new rows
create_rows = []

# loop through each row in the dataframe
for index, row in df.iterrows():
    
    # if the start and end dates are not the same, add rows for each day in between
    if row['Start Date'] != row['End Date']:
        delta = row['End Date'] - row['Start Date']
        for i in range(delta.days + 1):
            date = row['Start Date'] + timedelta(days=i)
            create_rows.append({'ID': row['ID'], 'Leave Type': row['Leave Type'], 'Start Date': date, 'End Date': date})
    
    # if the start and end dates are the same, append the original row
    else:
        create_rows.append({'ID': row['ID'], 'Leave Type': row['Leave Type'], 'Start Date': row['Start Date'], 'End Date': row['End Date']})
    
# create a new dataframe with the original rows and the new rows
output_df = pd.DataFrame(create_rows)

# sort the dataframe by ID and Start Date
output_df = output_df.sort_values(['ID', 'Start Date'])

# reset the index
output_df = output_df.reset_index(drop=True)

print(output_df)

答案3

得分: 0

import pandas as pd
from pandas.tseries.offsets import MonthEnd

df = pd.DataFrame({'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'],
                   'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
                   'End Date': ['2022-01-01', '', '2023-01-02', '2023-01-01']})
# Converts columns 'Leave Type' and 'Start Date' to datetime
df[['Start Date', 'End Date']] = \
    df[['Start Date', 'End Date']].apply(pd.to_datetime, errors='coerce')
# Fill NaT values with the last day of the month
df['End Date'] = df['End Date'].fillna(df['Start Date'] + MonthEnd(0))
# Replace 'Start Date' values with list of date ranges
df['End Date'] = \
    [pd.date_range(s, e, freq='D').tolist() for s,e in zip(df['Start Date'], df['End Date'])]
# Explode the list
df = df.explode('End Date')

df['Start Date'] = df['End Date']

print(df)
英文:
import pandas as pd
from pandas.tseries.offsets import MonthEnd

df = pd.DataFrame({'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'],
                   'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
                   'End Date': ['2022-01-01', '', '2023-01-02', '2023-01-01'],
                   })
# Converts columns 'Leave Type' and 'Start Date' to datetime
df[['Start Date', 'End Date']] = \
    df[['Start Date', 'End Date']].apply(pd.to_datetime, errors='coerce')
# Fill NaT values with the last day of the month
df['End Date'] = df['End Date'].fillna(df['Start Date'] + MonthEnd(0))
# Replace 'Start Date' values with list of date ranges
df['End Date'] = \
    [pd.date_range(s, e, freq='D').tolist() for s,e in zip(df['Start Date'], df['End Date'])]
# Explode the list
df = df.explode('End Date')

df['Start Date'] = df['End Date']

print(df)

Result

  Leave Type Start Date   End Date
0       Sick 2022-01-01 2022-01-01
1    Holiday 2023-03-28 2023-03-28
1    Holiday 2023-03-29 2023-03-29
1    Holiday 2023-03-30 2023-03-30
1    Holiday 2023-03-31 2023-03-31
2    Holiday 2023-01-01 2023-01-01
2    Holiday 2023-01-02 2023-01-02
3       Work 2023-01-01 2023-01-01

huangapple
  • 本文由 发表于 2023年3月31日 22:43:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/75899840.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定