英文:
Aggregating dataframe rows using groupby, combining multiple columns
问题
我有以下的pandas数据帧:
```python
import pandas as pd
from datetime import date, timedelta
df = pd.DataFrame(
(
(date(2023, 2, 27), timedelta(hours=0.5), "项目A", "规划"),
(date(2023, 2, 27), timedelta(hours=1), "项目A", "规划"),
(date(2023, 2, 27), timedelta(hours=1.5), "项目A", "执行"),
(date(2023, 2, 27), timedelta(hours=0.25), "项目B", "规划"),
(date(2023, 2, 28), timedelta(hours=3), "项目A", "总结"),
(date(2023, 2, 28), timedelta(hours=3), "项目B", "执行"),
(date(2023, 2, 28), timedelta(hours=2), "项目B", "杂项"),
),
columns=("日期", "持续时间", "项目", "描述"),
)
print(df)
>>> 日期 持续时间 项目 描述
>>> 0 2023-02-27 0 days 00:30:00 项目A 规划
>>> 1 2023-02-27 0 days 01:00:00 项目A 规划
>>> 2 2023-02-27 0 days 01:30:00 项目A 执行
>>> 3 2023-02-27 0 days 00:15:00 项目B 规划
>>> 4 2023-02-28 0 days 03:00:00 项目A 总结
>>> 5 2023-02-28 0 days 03:00:00 项目B 执行
>>> 6 2023-02-28 0 days 02:00:00 项目B 杂项
我想对 持续时间
和 描述
列进行聚合,按照 日期
和 项目
进行分组。结果应该类似于:
result = pd.DataFrame(
(
(
date(2023, 2, 27),
"项目A",
timedelta(hours=3),
"规划 (1.5), 执行 (1.5)",
),
(date(2023, 2, 27), "项目B", timedelta(hours=0.25), "规划"),
(date(2023, 2, 28), "项目A", timedelta(hours=3), "总结"),
(
date(2023, 2, 28),
"项目B",
timedelta(hours=5),
"执行 (3), 杂项 (2)",
),
),
columns=("日期", "项目", "持续时间", "描述"),
)
print(result)
>>> 日期 项目 持续时间 描述
>>> 0 2023-02-27 项目A 0 days 03:00:00 规划 (1.5), 执行 (1.5)
>>> 1 2023-02-27 项目B 0 days 00:15:00 规划
>>> 2 2023-02-28 项目A 0 days 03:00:00 总结
>>> 3 2023-02-28 项目B 0 days 05:00:00 执行 (3), 杂项 (2)
使用 groupby()
对 持续时间
列进行聚合很容易:
df.groupby(by=["日期", "项目"])["持续时间"].sum().to_frame().reset_index()
但是我不确定如何使用 groupby()
处理 描述
列。我考虑使用 DataFrameGroupBy.apply()
以及两个层级的自定义函数,一个是按照 日期
和 项目
进行分组,另一个是按照 描述
进行分组。类似于:
def agg_description(df):
...
def agg_date_project(df):
...
agg_description(...)
...
df.groupby(by=["日期", "项目"])["持续时间","描述"].apply(agg_date_project)
但我搞不清楚怎么做。一个让事情复杂化的因素是,对 描述
列的聚合也是基于 持续时间
列的。我可以用“手动”的方式来做(例如使用循环),但如果可能的话,我想也使用 groupby()
来实现。
<details>
<summary>英文:</summary>
I have the following pandas dataframe:
```python
import pandas as pd
from datetime import date, timedelta
df = pd.DataFrame(
(
(date(2023, 2, 27), timedelta(hours=0.5), "project A", "planning"),
(date(2023, 2, 27), timedelta(hours=1), "project A", "planning"),
(date(2023, 2, 27), timedelta(hours=1.5), "project A", "execution"),
(date(2023, 2, 27), timedelta(hours=0.25), "project B", "planning"),
(date(2023, 2, 28), timedelta(hours=3), "project A", "wrapup"),
(date(2023, 2, 28), timedelta(hours=3), "project B", "execution"),
(date(2023, 2, 28), timedelta(hours=2), "project B", "miscellaneous"),
),
columns=("date", "duration", "project", "description"),
)
print(df)
>>> date duration project description
>>> 0 2023-02-27 0 days 00:30:00 project A planning
>>> 1 2023-02-27 0 days 01:00:00 project A planning
>>> 2 2023-02-27 0 days 01:30:00 project A execution
>>> 3 2023-02-27 0 days 00:15:00 project B planning
>>> 4 2023-02-28 0 days 03:00:00 project A wrapup
>>> 5 2023-02-28 0 days 03:00:00 project B execution
>>> 6 2023-02-28 0 days 02:00:00 project B miscellaneous
I want to carry out aggregation for the duration
and description
columns, grouping by date
and project
. The result should look something like:
result = pd.DataFrame(
(
(
date(2023, 2, 27),
"project A",
timedelta(hours=3),
"planning (1.5), execution (1.5)",
),
(date(2023, 2, 27), "project B", timedelta(hours=0.25), "planning"),
(date(2023, 2, 28), "project A", timedelta(hours=3), "wrapup"),
(
date(2023, 2, 28),
"project B",
timedelta(hours=5),
"execution (3), miscellaneous (2)",
),
),
columns=("date", "project", "duration", "description"),
)
print(result)
>>> date project duration description
>>> 0 2023-02-27 project A 0 days 03:00:00 planning (1.5), execution (1.5)
>>> 1 2023-02-27 project B 0 days 00:15:00 planning
>>> 2 2023-02-28 project A 0 days 03:00:00 wrapup
>>> 3 2023-02-28 project B 0 days 05:00:00 execution (3), miscellaneous (2)
Aggregating the duration
column is easy using groupby()
:
df.groupby(by=["date", "project"])["duration"].sum().to_frame().reset_index()
But I'm unsure how to handle the description
column with groupby()
. I considered using DataFrameGroupBy.apply()
with custom functions on two levels, one for grouping by date
and project
, and one by description
. Something like:
def agg_description(df):
...
def agg_date_project(df):
...
agg_description(...)
...
df.groupby(by=["date", "project"])["duration","description"].apply(agg_date_project)
But I can't figure it out. A complicating factor is that the aggregation for the description
column is based on the duration
column as well.
I could do it "manually" (e.g. using loops) but if possible I'd like to do it using groupby()
as well.
答案1
得分: 2
首先,让我们计算每个日期、项目和描述的持续时间总和:
sum_df = df.groupby(by=["date", "project", "description"], as_index=False)["duration"].sum()
然后我们需要将持续时间转换为小时:
sum_df["duration_hours"] = sum_df["duration"].apply(lambda x: x.total_seconds()/60/60)
现在,我们可以格式化一个字符串以包含描述和时间:
sum_df["description_time"] = sum_df.apply(lambda x: f"{x['description']} ({x['duration_hours']})", axis=1)
接下来,我们可以通过仅按日期和项目分组来进行汇总,以获得最终的描述:
sum_df["final_description"] = sum_df.groupby(["date", "project"])["description_time"].transform(lambda x: ', '.join(x))
最后,您需要再次进行分组,并对持续时间进行聚合以将它们相加,保留最终描述(我只保留第一个,因为在分组的每组中所有值都是相同的):
sum_df.groupby(by=["date", "project"], as_index=False)[["duration", "final_description"]].agg({"duration": 'sum', "final_description": 'first'})
完成了!(请注意,格式可能与您期望的结果不完全相同,因为对于只有一种描述类型的列,仍然会在括号中显示时间,但我相信如果您真的需要的话,修改结果不会太难)
英文:
This one is a bit tricky, you will need a few intermediate steps:
First, let's compute the sum of the duration over each date, project and description:
sum_df = df.groupby(by=["date", "project", "description"], as_index=False)["duration"].sum()
Then we need to get the duration in hours
sum_df["duration_hours"] = sum_df["duration"].apply(lambda x: x.total_seconds()/60/60)
Now, we can format a string to contain the description, and time
sum_df["description_time"] = sum_df.apply(lambda x: f"{x['description']} ({x['duration_hours']})", axis=1)
Then, we can aggregate by grouping by date and project only, to get the final description:
sum_df["final_description"] = sum_df.groupby(["date", "project"])["description_time"].transform(lambda x: ', '.join(x))
Finally, you will need to groupby again, and aggregate the duration to sum them, and keep final_description (I keep only the first, since all values are the same, across each group of the group by)
sum_df.groupby(by=["date", "project"], as_index=False)[["duration", "final_description"]].agg({"duration": 'sum', "final_description": 'first'})
There you go!
(note that the formatting is not exactly the one you had in your expected result, since for the columns with only one type of description, there is still the time between parenthesis, but I believe it shouldn't be too hard to modify the result, if you really need to)
答案2
得分: 2
你可以一次完成,无需使用 `.apply`:
```python
result = (
df.groupby(["date", "project", "description"], as_index=False).sum()
.assign(description=lambda df:
df["description"] + " ("
+ (df["duration"].dt.total_seconds() / 3_600).astype("str") + ")"
)
.groupby(["date", "project"], as_index=False).agg({
"duration": "sum", "description": ", ".join
})
)
- 首先对每个
date
-project
-description
组进行求和。 - 然后使用相应的持续时间增强
description
列。 - 最后在
date
-project
组上聚合:对于duration
进行求和,并对description
进行,
连接。
结果:
date project duration description
0 2023-02-27 project A 0 days 03:00:00 execution (1.5), planning (1.5)
1 2023-02-27 project B 0 days 00:15:00 planning (0.25)
2 2023-02-28 project A 0 days 03:00:00 wrapup (3.0)
3 2023-02-28 project B 0 days 05:00:00 execution (3.0), miscellaneous (2.0)
如果你不想要这种程度的聚合(在一个列中),你可以这样做:
result = (
df.pivot_table(
values="duration", index=["date", "project"], columns="description",
aggfunc="sum", fill_value=pd.Timedelta(0)
)
.assign(duration=lambda df: df.sum(axis=1))
.reset_index()
)
结果:
description date project execution miscellaneous \
0 2023-02-27 project A 0 days 01:30:00 0 days 00:00:00
1 2023-02-27 project B 0 days 00:00:00 0 days 00:00:00
2 2023-02-28 project A 0 days 00:00:00 0 days 00:00:00
3 2023-02-28 project B 0 days 03:00:00 0 days 02:00:00
description planning wrapup duration
0 0 days 01:30:00 0 days 00:00:00 0 days 03:00:00
1 0 days 00:15:00 0 days 00:00:00 0 days 00:15:00
2 0 days 00:00:00 0 days 03:00:00 0 days 03:00:00
3 0 days 00:00:00 0 days 00:00:00 0 days 05:00:00
<details>
<summary>英文:</summary>
You can do that in one go, without any use of `.apply`:
```python
result = (
df.groupby(["date", "project", "description"], as_index=False).sum()
.assign(description=lambda df:
df["description"] + " ("
+ (df["duration"].dt.total_seconds() / 3_600).astype("str") + ")"
)
.groupby(["date", "project"], as_index=False).agg({
"duration": "sum", "description": ", ".join
})
)
- First calculate the sums for each
date
-project
-description
group. - Then augment the
description
column with the resp. durations. - Finally aggreate over
date
-project
groups: summing for theduration
s, and", ".join
-ing for thedescription
s.
Result:
date project duration description
0 2023-02-27 project A 0 days 03:00:00 execution (1.5), planning (1.5)
1 2023-02-27 project B 0 days 00:15:00 planning (0.25)
2 2023-02-28 project A 0 days 03:00:00 wrapup (3.0)
3 2023-02-28 project B 0 days 05:00:00 execution (3.0), miscellaneous (2.0)
If you don't want that level of aggreation for the parts (in one column), then you could do:
result = (
df.pivot_table(
values="duration", index=["date", "project"], columns="description",
aggfunc="sum", fill_value=pd.Timedelta(0)
)
.assign(duration=lambda df: df.sum(axis=1))
.reset_index()
)
Result:
description date project execution miscellaneous \
0 2023-02-27 project A 0 days 01:30:00 0 days 00:00:00
1 2023-02-27 project B 0 days 00:00:00 0 days 00:00:00
2 2023-02-28 project A 0 days 00:00:00 0 days 00:00:00
3 2023-02-28 project B 0 days 03:00:00 0 days 02:00:00
description planning wrapup duration
0 0 days 01:30:00 0 days 00:00:00 0 days 03:00:00
1 0 days 00:15:00 0 days 00:00:00 0 days 00:15:00
2 0 days 00:00:00 0 days 03:00:00 0 days 03:00:00
3 0 days 00:00:00 0 days 00:00:00 0 days 05:00:00
答案3
得分: 0
# %%
df['规划'] = df[df['描述'] == '规划']['持续时间']
df['执行'] = df[df['描述'] == '执行']['持续时间']
df['总结'] = df[df['描述'] == '总结']['持续时间']
df['杂项'] = df[df['描述'] == '杂项']['持续时间']
df = df.fillna(timedelta(hours=0))
df
#%%
项目持续时间 = df.groupby(by=["日期", "项目"])["持续时间"].sum().to_frame().reset_index()
项目持续时间
# %%
描述持续时间 = df.groupby(by=["日期", "项目"])[['规划','执行','总结','杂项']].sum().reset_index()
描述持续时间
# %%
最终结果 = 项目持续时间.merge(描述持续时间, on=['日期','项目'])
最终结果
# %%
英文:
Why do you want to have every description's duration in a single column? Here is how I would do it:
# %%
df['planning'] = df[df['description'] == 'planning']['duration']
df['execution'] = df[df['description'] == 'execution']['duration']
df['wrapup'] = df[df['description'] == 'wrapup']['duration']
df['miscellaneous'] = df[df['description'] == 'miscellaneous']['duration']
df = df.fillna(timedelta(hours=0))
df
#%%
proj_duration = df.groupby(by=["date", "project"])["duration"].sum().to_frame().reset_index()
proj_duration
# %%
description_dration = df.groupby(by=["date", "project"])[['planning','execution','wrapup','miscellaneous']].sum().reset_index()
description_dration
# %%
final = proj_duration.merge(description_dration, on=['date','project'])
final
# %%
Have a look to this image for the result
If you have few descriptions it works well, otherwise you can create a list of descriptions, and work on that with loops.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论