2023年5月25日 06:32:05go评论98阅读模式

英文:

Pandas Join Rows If Time Is Continuous

问题

我有一个pandas数据帧，显示员工想要休假的时间。事件标题始终以"User Off"的格式呈现，如果不是全天事件，还会包括特定时间。以下是数据帧df的片段：

                Event Title      Start        End Employee
    UserA Off (07:00-12:00) 2023-05-08 2023-05-09    UserA
    UserA Off (12:00-15:30) 2023-05-08 2023-05-09    UserA
                 UserB Off  2023-05-10 2023-05-11    UserB
    UserC Off (08:00-10:30) 2023-05-30 2023-05-31    UserC
    UserC Off (10:30-16:30) 2023-05-30 2023-05-31    UserC
    UserD Off (09:30-10:00) 2023-05-10 2023-05-11    UserD
    UserE Off (13:00-16:00) 2023-06-02 2023-06-03    UserE
    UserE Off (07:30-13:00) 2023-06-02 2023-06-03    UserE

A、C和E用户有相同的开始和结束日期，但时间是连续的（这是因为它们来自多个休假桶，如休假和浮动假期）。如果时间是连续的，如何最好地合并这些行？

这是我创建的第一个查询，只显示在同一天有多个条目的用户：

test = df.groupby(by=['Start','End','Employee']).filter(lambda x: len(x) > 1)

现在我需要查看时间是否连续（即，如果某人当天的首次休假结束在中午，那么此人的开始时间应该在中午开始）。如果连续，可以假定它是全天缺勤。

我的最初想法是迭代每个数据帧df.groupby(by=['Start','End','Employee'])，按Event Title进行排序，然后在每行上进行iterrows()，并解析是否Event Title.str.split('-')[1]等于下一行的Event Title.str.split('(')[1].str.split('-')[0]，但这似乎效率很低。

我的最终结果将是原始数据帧df，但连续时间将被合并，如下所示：

                 Event Title      Start        End Employee
                  UserA Off 2023-05-08 2023-05-09    UserA
                  UserB Off 2023-05-10 2023-05-11    UserB
                  UserC Off 2023-05-30 2023-05-31    UserC
    UserD Off (09:30-10:00) 2023-05-10 2023-05-11    UserD
                  UserE Off 2023-06-02 2023-06-03    UserE

英文:

I have a pandas dataframe that shows when employees want to take time off. The Event Title is always in the format of "User Off" along with the specific time if it is not an all day event. Here is a snippet of the dataframe df:

            Event Title      Start        End Employee
UserA Off (07:00-12:00) 2023-05-08 2023-05-09    UserA
UserA Off (12:00-15:30) 2023-05-08 2023-05-09    UserA
             UserB Off  2023-05-10 2023-05-11    UserB
UserC Off (08:00-10:30) 2023-05-30 2023-05-31    UserC
UserC Off (10:30-16:30) 2023-05-30 2023-05-31    UserC
UserD Off (09:30-10:00) 2023-05-10 2023-05-11    UserD
UserE Off (13:00-16:00) 2023-06-02 2023-06-03    UserE
UserE Off (07:30-13:00) 2023-06-02 2023-06-03    UserE

Users A, C, and E have 2 lines for the same start and end date but the times are continuous (this is because they pull from multiple vacation buckets like Vacation and Floating Holiday). What is the best way to combine these rows if the time is continuous?

Here is the first query I created to only display users that have multiple entries on the same day

test = df.groupby(by=[&#39;Start&#39;,&#39;End&#39;,&#39;Employee&#39;]).filter(lambda x: len(x) &gt; 1)
            Event Title      Start        End Employee
UserA Off (07:00-12:00) 2023-05-08 2023-05-09    UserA
UserA Off (12:00-15:30) 2023-05-08 2023-05-09    UserA
UserC Off (08:00-10:30) 2023-05-30 2023-05-31    UserC
UserC Off (10:30-16:30) 2023-05-30 2023-05-31    UserC
UserE Off (13:00-16:00) 2023-06-02 2023-06-03    UserE
UserE Off (07:30-13:00) 2023-06-02 2023-06-03    UserE

Now I need to see if the times are continuous (i.e. if the first time off on a day ends at noon, then the start time of for this individual should start at noon). It can be assumed it is an all-day absence if it is continuous.

My original idea was to iterate over each dataframe df.groupby(by=['Start','End','Employee']), sort by Event Title and then do iterrows() on each row and parse if Event Title.str.split('-')[1] equals Event Title.str.split('(')[1].str.split('-')[0] of the next row, but this seems very inefficient.

My end result would be the original dataframe df but have the continuous times be joined like so:

             Event Title      Start        End Employee
              UserA Off 2023-05-08 2023-05-09    UserA
              UserB Off 2023-05-10 2023-05-11    UserB
              UserC Off 2023-05-30 2023-05-31    UserC
UserD Off (09:30-10:00) 2023-05-10 2023-05-11    UserD
              UserE Off 2023-06-02 2023-06-03    UserE

答案1

得分: 2

鉴于您提供的数据，以下是如何使用groupby结合shift以及一些regex从Event Title列提取必要信息来连接连续缺勤的方法：

import pandas as pd
data = {
    'Event Title': [
        'UserA 请假 (07:00-12:00)', 'UserA 请假 (12:00-15:30)', 'UserB 请假',
        'UserC 请假 (08:00-10:30)', 'UserC 请假 (10:30-16:30)', 'UserD 请假 (09:30-10:00)',
        'UserE 请假 (13:00-16:00)', 'UserE 请假 (07:30-13:00)'
        ],
    'Start': [
        '2023-05-08', '2023-05-08', '2023-05-10', '2023-05-30',
        '2023-05-30', '2023-05-10', '2023-06-02', '2023-06-02'
        ],
    'End': [
        '2023-05-09', '2023-05-09', '2023-05-11', '2023-05-31',
        '2023-05-31', '2023-05-11', '2023-06-03', '2023-06-03'
        ],
    'Employee': [
        'UserA', 'UserA', 'UserB', 'UserC',
        'UserC', 'UserD', 'UserE', 'UserE'
        ]
}
df = pd.DataFrame(data)
# 提取开始和结束小时
df['_time_off_start_hour'] = df['Event Title'].str.extract(r'(?<=\()(\d\d:\d\d)')
df['_time_off_end_hour'] = df['Event Title'].str.extract(r'(\d\d:\d\d)(?=\))')
# 排序
df.sort_values(by=['Start','End','Employee', '_time_off_start_hour'], inplace=True)
# 计算辅助列
df['_time_off_start_next_event'] = (
    df
    .groupby(['Start','End','Employee'])['_time_off_start_hour']
    .shift(-1)
)
df['_time_off_end_previous_event'] = (
    df
    .groupby(['Start','End','Employee'])['_time_off_end_hour']
    .shift(1))
df['_continuous_time'] = (
    df.loc[:, '_time_off_start_next_event']
    == df.loc[:, '_time_off_end_hour']
)
df['_duplicated_row'] = (
    df.loc[:, '_time_off_end_previous_event']
    == df.loc[:, '_time_off_start_hour']
)
# 修改找到连续对的第一行
df.loc[df['_continuous_time'], 'Event Title'] = df['Employee'] + " 请假"
# 删除每个连续对的第二行
df = df.loc[~df['_duplicated_row'], :]
# 删除内部列
cols_to_drop = [col for col in df.columns if col.startswith("_")]
df.drop(columns=cols_to_drop, inplace=True)
# 显示结果
print(df.sort_values(by='Employee').to_markdown(index=False))

| Event Title      | Start      | End        | Employee   |
|:------------------|:-----------|:-----------|:-----------|
| UserA 请假       | 2023-05-08 | 2023-05-09 | UserA      |
| UserB 请假       | 2023-05-10 | 2023-05-11 | UserB      |
| UserC 请假       | 2023-05-30 | 2023-05-31 | UserC      |
| UserD 请假       | 2023-05-10 | 2023-05-11 | UserD      |
| UserE 请假       | 2023-06-02 | 2023-06-03 | UserE      |

英文:

Given the data you provided, here's how you can join absences which are continuous, using groupby in combination with shift as well as some regex to extract the necessary information from the Event Title column:

import pandas as pd
data = {
&#39;Event Title&#39;: [
&#39;UserA Off (07:00-12:00)&#39;, &#39;UserA Off (12:00-15:30)&#39;, &#39;UserB Off&#39;,
&#39;UserC Off (08:00-10:30)&#39;, &#39;UserC Off (10:30-16:30)&#39;, &#39;UserD Off (09:30-10:00)&#39;,
&#39;UserE Off (13:00-16:00)&#39;, &#39;UserE Off (07:30-13:00)&#39;
],
&#39;Start&#39;: [
&#39;2023-05-08&#39;, &#39;2023-05-08&#39;, &#39;2023-05-10&#39;, &#39;2023-05-30&#39;,
&#39;2023-05-30&#39;, &#39;2023-05-10&#39;, &#39;2023-06-02&#39;, &#39;2023-06-02&#39;
],
&#39;End&#39;: [
&#39;2023-05-09&#39;, &#39;2023-05-09&#39;, &#39;2023-05-11&#39;, &#39;2023-05-31&#39;,
&#39;2023-05-31&#39;, &#39;2023-05-11&#39;, &#39;2023-06-03&#39;, &#39;2023-06-03&#39;
],
&#39;Employee&#39;: [
&#39;UserA&#39;, &#39;UserA&#39;, &#39;UserB&#39;, &#39;UserC&#39;,
&#39;UserC&#39;, &#39;UserD&#39;, &#39;UserE&#39;, &#39;UserE&#39;
]
}
df = pd.DataFrame(data)
# Extract Start and End Hour
df[&#39;_time_off_start_hour&#39;] = df[&#39;Event Title&#39;].str.extract(r&#39;(?&lt;=\()(\d\d:\d\d)&#39;)
df[&#39;_time_off_end_hour&#39;] = df[&#39;Event Title&#39;].str.extract(r&#39;(\d\d:\d\d)(?=\))&#39;)
# Sort Values
df.sort_values(by=[&#39;Start&#39;,&#39;End&#39;,&#39;Employee&#39;, &#39;_time_off_start_hour&#39;], inplace=True)
# Compute Helper Columns
df[&#39;_time_off_start_next_event&#39;] = (
df
.groupby([&#39;Start&#39;,&#39;End&#39;,&#39;Employee&#39;])[&#39;_time_off_start_hour&#39;]
.shift(-1)
)
df[&#39;_time_off_end_previous_event&#39;] = (
df
.groupby([&#39;Start&#39;,&#39;End&#39;,&#39;Employee&#39;])[&#39;_time_off_end_hour&#39;]
.shift(1))
df[&#39;_continuous_time&#39;] = (
df.loc[:, &#39;_time_off_start_next_event&#39;]
== df.loc[:, &#39;_time_off_end_hour&#39;]
)
df[&#39;_duplicated_row&#39;] = (
df.loc[:, &#39;_time_off_end_previous_event&#39;]
== df.loc[:, &#39;_time_off_start_hour&#39;]
)
# Modify first row where continuous pair was found
df.loc[df[&#39;_continuous_time&#39;], &#39;Event Title&#39;] = df[&#39;Employee&#39;] + &quot; Off&quot;
# Drop second row of each continuous pair
df = df.loc[~df[&#39;_duplicated_row&#39;], :]
# Drop internal columns
cols_to_drop = [col for col in df.columns if col.startswith(&quot;_&quot;)]
df.drop(columns=cols_to_drop, inplace=True)
# Show results
print(df.sort_values(by=&#39;Employee&#39;).to_markdown(index=False))

Returns:

| Event Title             | Start      | End        | Employee   |
|:------------------------|:-----------|:-----------|:-----------|
| UserA Off               | 2023-05-08 | 2023-05-09 | UserA      |
| UserB Off               | 2023-05-10 | 2023-05-11 | UserB      |
| UserC Off               | 2023-05-30 | 2023-05-31 | UserC      |
| UserD Off (09:30-10:00) | 2023-05-10 | 2023-05-11 | UserD      |
| UserE Off               | 2023-06-02 | 2023-06-03 | UserE      |

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

连接 Pandas 行如果时间是连续的。

问题

答案1

Stripe Portal URL不安全 Python

使用pipx安装的tox时，如何让tox使用由pyenv安装的解释器？

使用Python代码覆盖Google Cloud Storage存储桶中的单个文件。

有没有办法优化这个NumPy索引重新分配？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。