连接 Pandas 行如果时间是连续的。

huangapple go评论57阅读模式
英文:

Pandas Join Rows If Time Is Continuous

问题

我有一个pandas数据帧,显示员工想要休假的时间。事件标题始终以"User Off"的格式呈现,如果不是全天事件,还会包括特定时间。以下是数据帧df的片段:

                Event Title      Start        End Employee
    UserA Off (07:00-12:00) 2023-05-08 2023-05-09    UserA
    UserA Off (12:00-15:30) 2023-05-08 2023-05-09    UserA
                 UserB Off  2023-05-10 2023-05-11    UserB
    UserC Off (08:00-10:30) 2023-05-30 2023-05-31    UserC
    UserC Off (10:30-16:30) 2023-05-30 2023-05-31    UserC
    UserD Off (09:30-10:00) 2023-05-10 2023-05-11    UserD
    UserE Off (13:00-16:00) 2023-06-02 2023-06-03    UserE
    UserE Off (07:30-13:00) 2023-06-02 2023-06-03    UserE

A、C和E用户有相同的开始和结束日期,但时间是连续的(这是因为它们来自多个休假桶,如休假和浮动假期)。如果时间是连续的,如何最好地合并这些行?

这是我创建的第一个查询,只显示在同一天有多个条目的用户:

test = df.groupby(by=['Start','End','Employee']).filter(lambda x: len(x) > 1)

现在我需要查看时间是否连续(即,如果某人当天的首次休假结束在中午,那么此人的开始时间应该在中午开始)。如果连续,可以假定它是全天缺勤。

我的最初想法是迭代每个数据帧df.groupby(by=['Start','End','Employee']),按Event Title进行排序,然后在每行上进行iterrows(),并解析是否Event Title.str.split('-')[1]等于下一行的Event Title.str.split('(')[1].str.split('-')[0],但这似乎效率很低。

我的最终结果将是原始数据帧df,但连续时间将被合并,如下所示:

                 Event Title      Start        End Employee
                  UserA Off 2023-05-08 2023-05-09    UserA
                  UserB Off 2023-05-10 2023-05-11    UserB
                  UserC Off 2023-05-30 2023-05-31    UserC
    UserD Off (09:30-10:00) 2023-05-10 2023-05-11    UserD
                  UserE Off 2023-06-02 2023-06-03    UserE
英文:

I have a pandas dataframe that shows when employees want to take time off. The Event Title is always in the format of "User Off" along with the specific time if it is not an all day event. Here is a snippet of the dataframe df:

            Event Title      Start        End Employee
UserA Off (07:00-12:00) 2023-05-08 2023-05-09    UserA
UserA Off (12:00-15:30) 2023-05-08 2023-05-09    UserA
             UserB Off  2023-05-10 2023-05-11    UserB
UserC Off (08:00-10:30) 2023-05-30 2023-05-31    UserC
UserC Off (10:30-16:30) 2023-05-30 2023-05-31    UserC
UserD Off (09:30-10:00) 2023-05-10 2023-05-11    UserD
UserE Off (13:00-16:00) 2023-06-02 2023-06-03    UserE
UserE Off (07:30-13:00) 2023-06-02 2023-06-03    UserE

Users A, C, and E have 2 lines for the same start and end date but the times are continuous (this is because they pull from multiple vacation buckets like Vacation and Floating Holiday). What is the best way to combine these rows if the time is continuous?

Here is the first query I created to only display users that have multiple entries on the same day

test = df.groupby(by=['Start','End','Employee']).filter(lambda x: len(x) > 1)

            Event Title      Start        End Employee
UserA Off (07:00-12:00) 2023-05-08 2023-05-09    UserA
UserA Off (12:00-15:30) 2023-05-08 2023-05-09    UserA
UserC Off (08:00-10:30) 2023-05-30 2023-05-31    UserC
UserC Off (10:30-16:30) 2023-05-30 2023-05-31    UserC
UserE Off (13:00-16:00) 2023-06-02 2023-06-03    UserE
UserE Off (07:30-13:00) 2023-06-02 2023-06-03    UserE

Now I need to see if the times are continuous (i.e. if the first time off on a day ends at noon, then the start time of for this individual should start at noon). It can be assumed it is an all-day absence if it is continuous.

My original idea was to iterate over each dataframe df.groupby(by=['Start','End','Employee']), sort by Event Title and then do iterrows() on each row and parse if Event Title.str.split('-')[1] equals Event Title.str.split('(')[1].str.split('-')[0] of the next row, but this seems very inefficient.

My end result would be the original dataframe df but have the continuous times be joined like so:

             Event Title      Start        End Employee
              UserA Off 2023-05-08 2023-05-09    UserA
              UserB Off 2023-05-10 2023-05-11    UserB
              UserC Off 2023-05-30 2023-05-31    UserC
UserD Off (09:30-10:00) 2023-05-10 2023-05-11    UserD
              UserE Off 2023-06-02 2023-06-03    UserE

答案1

得分: 2

鉴于您提供的数据,以下是如何使用groupby结合shift以及一些regexEvent Title列提取必要信息来连接连续缺勤的方法:

import pandas as pd

data = {
    'Event Title': [
        'UserA 请假 (07:00-12:00)', 'UserA 请假 (12:00-15:30)', 'UserB 请假',
        'UserC 请假 (08:00-10:30)', 'UserC 请假 (10:30-16:30)', 'UserD 请假 (09:30-10:00)',
        'UserE 请假 (13:00-16:00)', 'UserE 请假 (07:30-13:00)'
        ],
    'Start': [
        '2023-05-08', '2023-05-08', '2023-05-10', '2023-05-30',
        '2023-05-30', '2023-05-10', '2023-06-02', '2023-06-02'
        ],
    'End': [
        '2023-05-09', '2023-05-09', '2023-05-11', '2023-05-31',
        '2023-05-31', '2023-05-11', '2023-06-03', '2023-06-03'
        ],
    'Employee': [
        'UserA', 'UserA', 'UserB', 'UserC',
        'UserC', 'UserD', 'UserE', 'UserE'
        ]
}

df = pd.DataFrame(data)

# 提取开始和结束小时
df['_time_off_start_hour'] = df['Event Title'].str.extract(r'(?<=\()(\d\d:\d\d)')
df['_time_off_end_hour'] = df['Event Title'].str.extract(r'(\d\d:\d\d)(?=\))')

# 排序
df.sort_values(by=['Start','End','Employee', '_time_off_start_hour'], inplace=True)

# 计算辅助列
df['_time_off_start_next_event'] = (
    df
    .groupby(['Start','End','Employee'])['_time_off_start_hour']
    .shift(-1)
)
df['_time_off_end_previous_event'] = (
    df
    .groupby(['Start','End','Employee'])['_time_off_end_hour']
    .shift(1))
df['_continuous_time'] = (
    df.loc[:, '_time_off_start_next_event']
    == df.loc[:, '_time_off_end_hour']
)
df['_duplicated_row'] = (
    df.loc[:, '_time_off_end_previous_event']
    == df.loc[:, '_time_off_start_hour']
)

# 修改找到连续对的第一行
df.loc[df['_continuous_time'], 'Event Title'] = df['Employee'] + " 请假"
# 删除每个连续对的第二行
df = df.loc[~df['_duplicated_row'], :]

# 删除内部列
cols_to_drop = [col for col in df.columns if col.startswith("_")]
df.drop(columns=cols_to_drop, inplace=True)

# 显示结果
print(df.sort_values(by='Employee').to_markdown(index=False))

返回:

| Event Title      | Start      | End        | Employee   |
|:------------------|:-----------|:-----------|:-----------|
| UserA 请假       | 2023-05-08 | 2023-05-09 | UserA      |
| UserB 请假       | 2023-05-10 | 2023-05-11 | UserB      |
| UserC 请假       | 2023-05-30 | 2023-05-31 | UserC      |
| UserD 请假       | 2023-05-10 | 2023-05-11 | UserD      |
| UserE 请假       | 2023-06-02 | 2023-06-03 | UserE      |
英文:

Given the data you provided, here's how you can join absences which are continuous, using groupby in combination with shift as well as some regex to extract the necessary information from the Event Title column:

import pandas as pd
data = {
&#39;Event Title&#39;: [
&#39;UserA Off (07:00-12:00)&#39;, &#39;UserA Off (12:00-15:30)&#39;, &#39;UserB Off&#39;,
&#39;UserC Off (08:00-10:30)&#39;, &#39;UserC Off (10:30-16:30)&#39;, &#39;UserD Off (09:30-10:00)&#39;,
&#39;UserE Off (13:00-16:00)&#39;, &#39;UserE Off (07:30-13:00)&#39;
],
&#39;Start&#39;: [
&#39;2023-05-08&#39;, &#39;2023-05-08&#39;, &#39;2023-05-10&#39;, &#39;2023-05-30&#39;,
&#39;2023-05-30&#39;, &#39;2023-05-10&#39;, &#39;2023-06-02&#39;, &#39;2023-06-02&#39;
],
&#39;End&#39;: [
&#39;2023-05-09&#39;, &#39;2023-05-09&#39;, &#39;2023-05-11&#39;, &#39;2023-05-31&#39;,
&#39;2023-05-31&#39;, &#39;2023-05-11&#39;, &#39;2023-06-03&#39;, &#39;2023-06-03&#39;
],
&#39;Employee&#39;: [
&#39;UserA&#39;, &#39;UserA&#39;, &#39;UserB&#39;, &#39;UserC&#39;,
&#39;UserC&#39;, &#39;UserD&#39;, &#39;UserE&#39;, &#39;UserE&#39;
]
}
df = pd.DataFrame(data)
# Extract Start and End Hour
df[&#39;_time_off_start_hour&#39;] = df[&#39;Event Title&#39;].str.extract(r&#39;(?&lt;=\()(\d\d:\d\d)&#39;)
df[&#39;_time_off_end_hour&#39;] = df[&#39;Event Title&#39;].str.extract(r&#39;(\d\d:\d\d)(?=\))&#39;)
# Sort Values
df.sort_values(by=[&#39;Start&#39;,&#39;End&#39;,&#39;Employee&#39;, &#39;_time_off_start_hour&#39;], inplace=True)
# Compute Helper Columns
df[&#39;_time_off_start_next_event&#39;] = (
df
.groupby([&#39;Start&#39;,&#39;End&#39;,&#39;Employee&#39;])[&#39;_time_off_start_hour&#39;]
.shift(-1)
)
df[&#39;_time_off_end_previous_event&#39;] = (
df
.groupby([&#39;Start&#39;,&#39;End&#39;,&#39;Employee&#39;])[&#39;_time_off_end_hour&#39;]
.shift(1))
df[&#39;_continuous_time&#39;] = (
df.loc[:, &#39;_time_off_start_next_event&#39;]
== df.loc[:, &#39;_time_off_end_hour&#39;]
)
df[&#39;_duplicated_row&#39;] = (
df.loc[:, &#39;_time_off_end_previous_event&#39;]
== df.loc[:, &#39;_time_off_start_hour&#39;]
)
# Modify first row where continuous pair was found
df.loc[df[&#39;_continuous_time&#39;], &#39;Event Title&#39;] = df[&#39;Employee&#39;] + &quot; Off&quot;
# Drop second row of each continuous pair
df = df.loc[~df[&#39;_duplicated_row&#39;], :]
# Drop internal columns
cols_to_drop = [col for col in df.columns if col.startswith(&quot;_&quot;)]
df.drop(columns=cols_to_drop, inplace=True)
# Show results
print(df.sort_values(by=&#39;Employee&#39;).to_markdown(index=False))

Returns:

| Event Title             | Start      | End        | Employee   |
|:------------------------|:-----------|:-----------|:-----------|
| UserA Off               | 2023-05-08 | 2023-05-09 | UserA      |
| UserB Off               | 2023-05-10 | 2023-05-11 | UserB      |
| UserC Off               | 2023-05-30 | 2023-05-31 | UserC      |
| UserD Off (09:30-10:00) | 2023-05-10 | 2023-05-11 | UserD      |
| UserE Off               | 2023-06-02 | 2023-06-03 | UserE      |

huangapple
  • 本文由 发表于 2023年5月25日 06:32:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76327794.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定