2023年3月1日 16:41:21go评论110阅读模式

英文:

Aggregating dataframe rows using groupby, combining multiple columns

问题

我有以下的pandas数据帧：
```python
import pandas as pd
from datetime import date, timedelta
df = pd.DataFrame(
    (
        (date(2023, 2, 27), timedelta(hours=0.5), "项目A", "规划"),
        (date(2023, 2, 27), timedelta(hours=1), "项目A", "规划"),
        (date(2023, 2, 27), timedelta(hours=1.5), "项目A", "执行"),
        (date(2023, 2, 27), timedelta(hours=0.25), "项目B", "规划"),
        (date(2023, 2, 28), timedelta(hours=3), "项目A", "总结"),
        (date(2023, 2, 28), timedelta(hours=3), "项目B", "执行"),
        (date(2023, 2, 28), timedelta(hours=2), "项目B", "杂项"),
    ),
    columns=("日期", "持续时间", "项目", "描述"),
)
print(df)
>>>          日期           持续时间       项目            描述
>>> 0  2023-02-27 0 days 00:30:00  项目A       规划
>>> 1  2023-02-27 0 days 01:00:00  项目A       规划
>>> 2  2023-02-27 0 days 01:30:00  项目A       执行
>>> 3  2023-02-27 0 days 00:15:00  项目B       规划
>>> 4  2023-02-28 0 days 03:00:00  项目A       总结
>>> 5  2023-02-28 0 days 03:00:00  项目B       执行
>>> 6  2023-02-28 0 days 02:00:00  项目B       杂项

我想对 持续时间 和 描述 列进行聚合，按照 日期 和 项目 进行分组。结果应该类似于：

result = pd.DataFrame(
    (
        (
            date(2023, 2, 27),
            "项目A",
            timedelta(hours=3),
            "规划 (1.5), 执行 (1.5)",
        ),
        (date(2023, 2, 27), "项目B", timedelta(hours=0.25), "规划"),
        (date(2023, 2, 28), "项目A", timedelta(hours=3), "总结"),
        (
            date(2023, 2, 28),
            "项目B",
            timedelta(hours=5),
            "执行 (3), 杂项 (2)",
        ),
    ),
    columns=("日期", "项目", "持续时间", "描述"),
)
print(result)
>>>          日期        项目       持续时间                 描述
>>> 0  2023-02-27  项目A      0 days 03:00:00   规划 (1.5), 执行 (1.5)
>>> 1  2023-02-27  项目B      0 days 00:15:00   规划
>>> 2  2023-02-28  项目A      0 days 03:00:00   总结
>>> 3  2023-02-28  项目B      0 days 05:00:00   执行 (3), 杂项 (2)

使用 groupby() 对 持续时间 列进行聚合很容易：

df.groupby(by=["日期", "项目"])["持续时间"].sum().to_frame().reset_index()

但是我不确定如何使用 groupby() 处理 描述 列。我考虑使用 DataFrameGroupBy.apply() 以及两个层级的自定义函数，一个是按照 日期 和 项目 进行分组，另一个是按照 描述 进行分组。类似于：

def agg_description(df):
    ...
    
def agg_date_project(df):
  ...
  agg_description(...)
  ...
df.groupby(by=["日期", "项目"])["持续时间","描述"].apply(agg_date_project)

但我搞不清楚怎么做。一个让事情复杂化的因素是，对 描述 列的聚合也是基于 持续时间 列的。我可以用“手动”的方式来做（例如使用循环），但如果可能的话，我想也使用 groupby() 来实现。


<details>
<summary>英文:</summary>
I have the following pandas dataframe:
```python
import pandas as pd
from datetime import date, timedelta
df = pd.DataFrame(
    (
        (date(2023, 2, 27), timedelta(hours=0.5), &quot;project A&quot;, &quot;planning&quot;),
        (date(2023, 2, 27), timedelta(hours=1), &quot;project A&quot;, &quot;planning&quot;),
        (date(2023, 2, 27), timedelta(hours=1.5), &quot;project A&quot;, &quot;execution&quot;),
        (date(2023, 2, 27), timedelta(hours=0.25), &quot;project B&quot;, &quot;planning&quot;),
        (date(2023, 2, 28), timedelta(hours=3), &quot;project A&quot;, &quot;wrapup&quot;),
        (date(2023, 2, 28), timedelta(hours=3), &quot;project B&quot;, &quot;execution&quot;),
        (date(2023, 2, 28), timedelta(hours=2), &quot;project B&quot;, &quot;miscellaneous&quot;),
    ),
    columns=(&quot;date&quot;, &quot;duration&quot;, &quot;project&quot;, &quot;description&quot;),
)
print(df)
&gt;&gt;&gt;          date        duration    project    description
&gt;&gt;&gt; 0  2023-02-27 0 days 00:30:00  project A       planning
&gt;&gt;&gt; 1  2023-02-27 0 days 01:00:00  project A       planning
&gt;&gt;&gt; 2  2023-02-27 0 days 01:30:00  project A      execution
&gt;&gt;&gt; 3  2023-02-27 0 days 00:15:00  project B       planning
&gt;&gt;&gt; 4  2023-02-28 0 days 03:00:00  project A         wrapup
&gt;&gt;&gt; 5  2023-02-28 0 days 03:00:00  project B      execution
&gt;&gt;&gt; 6  2023-02-28 0 days 02:00:00  project B  miscellaneous

I want to carry out aggregation for the duration and description columns, grouping by date and project. The result should look something like:

result = pd.DataFrame(
    (
        (
            date(2023, 2, 27),
            &quot;project A&quot;,
            timedelta(hours=3),
            &quot;planning (1.5), execution (1.5)&quot;,
        ),
        (date(2023, 2, 27), &quot;project B&quot;, timedelta(hours=0.25), &quot;planning&quot;),
        (date(2023, 2, 28), &quot;project A&quot;, timedelta(hours=3), &quot;wrapup&quot;),
        (
            date(2023, 2, 28),
            &quot;project B&quot;,
            timedelta(hours=5),
            &quot;execution (3), miscellaneous (2)&quot;,
        ),
    ),
    columns=(&quot;date&quot;, &quot;project&quot;, &quot;duration&quot;, &quot;description&quot;),
)
print(result)
&gt;&gt;&gt;          date    project        duration                       description
&gt;&gt;&gt; 0  2023-02-27  project A 0 days 03:00:00   planning (1.5), execution (1.5)
&gt;&gt;&gt; 1  2023-02-27  project B 0 days 00:15:00                          planning
&gt;&gt;&gt; 2  2023-02-28  project A 0 days 03:00:00                            wrapup
&gt;&gt;&gt; 3  2023-02-28  project B 0 days 05:00:00  execution (3), miscellaneous (2)

Aggregating the duration column is easy using groupby():

df.groupby(by=[&quot;date&quot;, &quot;project&quot;])[&quot;duration&quot;].sum().to_frame().reset_index()

But I'm unsure how to handle the description column with groupby(). I considered using DataFrameGroupBy.apply() with custom functions on two levels, one for grouping by date and project, and one by description. Something like:

def agg_description(df):
    ...
    
def agg_date_project(df):
  ...
  agg_description(...)
  ...
df.groupby(by=[&quot;date&quot;, &quot;project&quot;])[&quot;duration&quot;,&quot;description&quot;].apply(agg_date_project)

But I can't figure it out. A complicating factor is that the aggregation for the description column is based on the duration column as well.
I could do it "manually" (e.g. using loops) but if possible I'd like to do it using groupby() as well.

答案1

得分: 2

首先，让我们计算每个日期、项目和描述的持续时间总和：

sum_df = df.groupby(by=["date", "project", "description"], as_index=False)["duration"].sum()

然后我们需要将持续时间转换为小时：

sum_df["duration_hours"] = sum_df["duration"].apply(lambda x: x.total_seconds()/60/60)

现在，我们可以格式化一个字符串以包含描述和时间：

sum_df["description_time"] = sum_df.apply(lambda x: f"{x['description']} ({x['duration_hours']})", axis=1)

接下来，我们可以通过仅按日期和项目分组来进行汇总，以获得最终的描述：

sum_df["final_description"] = sum_df.groupby(["date", "project"])["description_time"].transform(lambda x: ', '.join(x))

最后，您需要再次进行分组，并对持续时间进行聚合以将它们相加，保留最终描述（我只保留第一个，因为在分组的每组中所有值都是相同的）：

sum_df.groupby(by=["date", "project"], as_index=False)[["duration", "final_description"]].agg({"duration": 'sum', "final_description": 'first'})

完成了！（请注意，格式可能与您期望的结果不完全相同，因为对于只有一种描述类型的列，仍然会在括号中显示时间，但我相信如果您真的需要的话，修改结果不会太难）

英文:

This one is a bit tricky, you will need a few intermediate steps:

First, let's compute the sum of the duration over each date, project and description:

sum_df = df.groupby(by=[&quot;date&quot;, &quot;project&quot;, &quot;description&quot;], as_index=False)[&quot;duration&quot;].sum()

Then we need to get the duration in hours

sum_df[&quot;duration_hours&quot;] = sum_df[&quot;duration&quot;].apply(lambda x: x.total_seconds()/60/60)

Now, we can format a string to contain the description, and time

sum_df[&quot;description_time&quot;] = sum_df.apply(lambda x: f&quot;{x[&#39;description&#39;]} ({x[&#39;duration_hours&#39;]})&quot;, axis=1)

Then, we can aggregate by grouping by date and project only, to get the final description:

sum_df[&quot;final_description&quot;] = sum_df.groupby([&quot;date&quot;, &quot;project&quot;])[&quot;description_time&quot;].transform(lambda x: &#39;, &#39;.join(x))

Finally, you will need to groupby again, and aggregate the duration to sum them, and keep final_description (I keep only the first, since all values are the same, across each group of the group by)

sum_df.groupby(by=[&quot;date&quot;, &quot;project&quot;], as_index=False)[[&quot;duration&quot;, &quot;final_description&quot;]].agg({&quot;duration&quot;: &#39;sum&#39;, &quot;final_description&quot;: &#39;first&#39;})

There you go!

(note that the formatting is not exactly the one you had in your expected result, since for the columns with only one type of description, there is still the time between parenthesis, but I believe it shouldn't be too hard to modify the result, if you really need to)

答案2

得分: 2

你可以一次完成，无需使用 `.apply`：
```python
result = (
    df.groupby(["date", "project", "description"], as_index=False).sum()
    .assign(description=lambda df:
        df["description"] + " ("
        + (df["duration"].dt.total_seconds() / 3_600).astype("str") + ")"
    )
    .groupby(["date", "project"], as_index=False).agg({
        "duration": "sum", "description": ", ".join
    })
)

首先对每个 date-project-description 组进行求和。
然后使用相应的持续时间增强 description 列。
最后在 date-project 组上聚合：对于 duration 进行求和，并对 description 进行 , 连接。

结果：

         date    project        duration                           description
0  2023-02-27  project A 0 days 03:00:00       execution (1.5), planning (1.5)
1  2023-02-27  project B 0 days 00:15:00                       planning (0.25)
2  2023-02-28  project A 0 days 03:00:00                          wrapup (3.0)
3  2023-02-28  project B 0 days 05:00:00  execution (3.0), miscellaneous (2.0)

如果你不想要这种程度的聚合（在一个列中），你可以这样做：

result = (
    df.pivot_table(
        values="duration", index=["date", "project"], columns="description",
        aggfunc="sum", fill_value=pd.Timedelta(0)
    )
    .assign(duration=lambda df: df.sum(axis=1))
    .reset_index()
)

结果：

description        date    project       execution   miscellaneous  \
0            2023-02-27  project A 0 days 01:30:00 0 days 00:00:00   
1            2023-02-27  project B 0 days 00:00:00 0 days 00:00:00   
2            2023-02-28  project A 0 days 00:00:00 0 days 00:00:00   
3            2023-02-28  project B 0 days 03:00:00 0 days 02:00:00   
description        planning          wrapup        duration  
0           0 days 01:30:00 0 days 00:00:00 0 days 03:00:00  
1           0 days 00:15:00 0 days 00:00:00 0 days 00:15:00  
2           0 days 00:00:00 0 days 03:00:00 0 days 03:00:00  
3           0 days 00:00:00 0 days 00:00:00 0 days 05:00:00


<details>
<summary>英文:</summary>
You can do that in one go, without any use of `.apply`:
```python
result = (
    df.groupby([&quot;date&quot;, &quot;project&quot;, &quot;description&quot;], as_index=False).sum()
    .assign(description=lambda df:
        df[&quot;description&quot;] + &quot; (&quot;
        + (df[&quot;duration&quot;].dt.total_seconds() / 3_600).astype(&quot;str&quot;) + &quot;)&quot;
    )
    .groupby([&quot;date&quot;, &quot;project&quot;], as_index=False).agg({
        &quot;duration&quot;: &quot;sum&quot;, &quot;description&quot;: &quot;, &quot;.join
    })
)

First calculate the sums for each date-project-description group.
Then augment the description column with the resp. durations.
Finally aggreate over date-project groups: summing for the durations, and ", ".join-ing for the descriptions.

Result:

         date    project        duration                           description
0  2023-02-27  project A 0 days 03:00:00       execution (1.5), planning (1.5)
1  2023-02-27  project B 0 days 00:15:00                       planning (0.25)
2  2023-02-28  project A 0 days 03:00:00                          wrapup (3.0)
3  2023-02-28  project B 0 days 05:00:00  execution (3.0), miscellaneous (2.0)

If you don't want that level of aggreation for the parts (in one column), then you could do:

result = (
    df.pivot_table(
        values=&quot;duration&quot;, index=[&quot;date&quot;, &quot;project&quot;], columns=&quot;description&quot;,
        aggfunc=&quot;sum&quot;, fill_value=pd.Timedelta(0)
    )
    .assign(duration=lambda df: df.sum(axis=1))
    .reset_index()
)

Result:

description        date    project       execution   miscellaneous  \
0            2023-02-27  project A 0 days 01:30:00 0 days 00:00:00   
1            2023-02-27  project B 0 days 00:00:00 0 days 00:00:00   
2            2023-02-28  project A 0 days 00:00:00 0 days 00:00:00   
3            2023-02-28  project B 0 days 03:00:00 0 days 02:00:00   
description        planning          wrapup        duration  
0           0 days 01:30:00 0 days 00:00:00 0 days 03:00:00  
1           0 days 00:15:00 0 days 00:00:00 0 days 00:15:00  
2           0 days 00:00:00 0 days 03:00:00 0 days 03:00:00  
3           0 days 00:00:00 0 days 00:00:00 0 days 05:00:00

答案3

得分: 0

# %%
df['规划'] = df[df['描述'] == '规划']['持续时间']
df['执行'] = df[df['描述'] == '执行']['持续时间']
df['总结'] = df[df['描述'] == '总结']['持续时间']
df['杂项'] = df[df['描述'] == '杂项']['持续时间']
df = df.fillna(timedelta(hours=0))
df
#%%
项目持续时间 = df.groupby(by=["日期", "项目"])["持续时间"].sum().to_frame().reset_index()
项目持续时间
# %%
描述持续时间 = df.groupby(by=["日期", "项目"])[['规划','执行','总结','杂项']].sum().reset_index()
描述持续时间
# %%
最终结果 = 项目持续时间.merge(描述持续时间, on=['日期','项目'])
最终结果
# %%

英文:

Why do you want to have every description's duration in a single column? Here is how I would do it:

# %%
df[&#39;planning&#39;] = df[df[&#39;description&#39;] == &#39;planning&#39;][&#39;duration&#39;]
df[&#39;execution&#39;] = df[df[&#39;description&#39;] == &#39;execution&#39;][&#39;duration&#39;]
df[&#39;wrapup&#39;] = df[df[&#39;description&#39;] == &#39;wrapup&#39;][&#39;duration&#39;]
df[&#39;miscellaneous&#39;] = df[df[&#39;description&#39;] == &#39;miscellaneous&#39;][&#39;duration&#39;]
df = df.fillna(timedelta(hours=0))
df
#%%
proj_duration = df.groupby(by=[&quot;date&quot;, &quot;project&quot;])[&quot;duration&quot;].sum().to_frame().reset_index()
proj_duration
# %%
description_dration = df.groupby(by=[&quot;date&quot;, &quot;project&quot;])[[&#39;planning&#39;,&#39;execution&#39;,&#39;wrapup&#39;,&#39;miscellaneous&#39;]].sum().reset_index()
description_dration
# %%
final = proj_duration.merge(description_dration, on=[&#39;date&#39;,&#39;project&#39;])
final
# %%

Have a look to this image for the result

If you have few descriptions it works well, otherwise you can create a list of descriptions, and work on that with loops.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用groupby对数据框的行进行聚合，将多个列合并。

问题

答案1

答案2

答案3

python：根据索引事件将时间间隔数据分成两天的块

如何在Python中使用文本文件创建列表

使用python-docx在文本前面或后面插入图片。

scipy.integrate.quad 报错：给定无效可调用对象。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。