英文:
Pandas .groupby and .mean() based on conditions
问题
我有以下大型数据集,记录了数学竞赛结果,按日期降序排列的学生:例如,学生1在比赛1中获得第三名,而学生3赢得了比赛2,依此类推。
Race_ID Date Student_ID Rank Studying_hours
1 1/1/2023 1 3 2
1 1/1/2023 2 2 5
1 1/1/2023 3 1 7
1 1/1/2023 4 4 1
2 11/9/2022 1 2 4
2 11/9/2022 2 3 2
2 11/9/2022 3 1 8
3 17/4/2022 5 4 3
3 17/4/2022 2 1 7
3 17/4/2022 3 2 2
3 17/4/2022 4 3 3
4 1/3/2022 1 3 7
4 1/3/2022 2 2 2
5 1/1/2021 1 2 2
5 1/1/2021 2 3 3
5 1/1/2021 3 1 6
我想生成一个名为"winning_past_studying_hours"的新列,其中包含他以前的比赛中获得第1或第2名的平均学习小时数。
例如,对于学生1:
Race_ID Date Student_ID Rank Studying_hours
1 1/1/2023 1 3 2
2 11/9/2022 1 2 4
4 1/3/2022 1 3 7
5 1/1/2021 1 2 2
该列如下所示:
Race_ID Date Student_ID Rank Studying_hours winning_past_studying_hours
1 1/1/2023 1 3 2 (4+2)/2 = 3
2 11/9/2022 1 2 4 2/1 = 2
4 1/3/2022 1 3 7 2/1= 2
5 1/1/2021 1 2 2 NaN
同样,对于学生2:
Race_ID Date Student_ID Rank Studying_hours
1 1/1/2023 2 2 5
2 11/9/2022 2 3 2
3 17/4/2022 2 1 7
4 1/3/2022 2 2 2
5 1/1/2021 2 3 3
该列如下所示:
Race_ID Date Student_ID Rank Studying_hours winning_past_studying_hours
1 1/1/2023 2 2 5 (7+2)/2=4.5
2 11/9/2022 2 3 2 (7+2)/2=4.5
3 17/4/2022 2 1 7 2/1=2
4 1/3/2022 2 2 2 NaN
5 1/1/2021 2 3 3 NaN
我知道基本的groupby
和mean
函数,但我不知道如何在groupby
函数中包含条件Rank.isin([1,2])
。非常感谢。
英文:
I have the following large dataset recording the result of a math competition among students in descending order of date: So for example, student 1 comes third in Race 1 while student 3 won Race 2, etc.
Race_ID Date Student_ID Rank Studying_hours
1 1/1/2023 1 3 2
1 1/1/2023 2 2 5
1 1/1/2023 3 1 7
1 1/1/2023 4 4 1
2 11/9/2022 1 2 4
2 11/9/2022 2 3 2
2 11/9/2022 3 1 8
3 17/4/2022 5 4 3
3 17/4/2022 2 1 7
3 17/4/2022 3 2 2
3 17/4/2022 4 3 3
4 1/3/2022 1 3 7
4 1/3/2022 2 2 2
5 1/1/2021 1 2 2
5 1/1/2021 2 3 3
5 1/1/2021 3 1 6
and I want to generate a new column called "winning_past_studying_hours" which is the average studying hours of his past competitions and where he ended up with Rank 1 or 2.
So for example, for student 1:
Race_ID Date Student_ID Rank Studying_hours
1 1/1/2023 1 3 2
2 11/9/2022 1 2 4
4 1/3/2022 1 3 7
5 1/1/2021 1 2 2
the column looks like
Race_ID Date Student_ID Rank Studying_hours winning_past_studying_hours
1 1/1/2023 1 3 2 (4+2)/2 = 3
2 11/9/2022 1 2 4 2/1 = 2
4 1/3/2022 1 3 7 2/1= 2
5 1/1/2021 1 2 2 NaN
Similarly, for student 2:
Race_ID Date Student_ID Rank Studying_hours
1 1/1/2023 2 2 5
2 11/9/2022 2 3 2
3 17/4/2022 2 1 7
4 1/3/2022 2 2 2
5 1/1/2021 2 3 3
The column looks like
Race_ID Date Student_ID Rank Studying_hours winning_past_studying_hours
1 1/1/2023 2 2 5 (7+2)/2=4.5
2 11/9/2022 2 3 2 (7+2)/2=4.5
3 17/4/2022 2 1 7 2/1=2
4 1/3/2022 2 2 2 NaN
5 1/1/2021 2 3 3 NaN
I know the basic groupby
and mean
function but I do not know how to include the condition Rank.isin([1,2])
in the groupby
function. Thank you so much.
EDIT: Desired output:
Race_ID Date Student_ID Rank Studying_hours winning_past_studying_hours
1 1/1/2023 1 3 2 3
1 1/1/2023 2 2 5 4.5
1 1/1/2023 3 1 7 5.333
1 1/1/2023 4 4 1 NaN
2 11/9/2022 1 2 4 2
2 11/9/2022 2 3 2 4.5
2 11/9/2022 3 1 8 4
3 17/4/2022 5 4 3 NaN
3 17/4/2022 2 1 7 2
3 17/4/2022 3 2 2 6
3 17/4/2022 4 3 3 NaN
4 1/3/2022 1 3 7 2
4 1/3/2022 2 2 2 NaN
5 1/1/2021 1 2 2 NaN
5 1/1/2021 2 3 3 NaN
5 1/1/2021 3 1 6 NaN
答案1
得分: 4
我们用 np.NaN
替换了学生没有“赢得”比赛的每个比赛的学习小时数,对平均值的计算没有影响。
使用一个大数作为窗口函数 rolling
,获得一个扩展窗口,并通过指定 closed='left'
计算过去的运行均值,该参数会丢弃最近的条目。
然后我们重新连接。
large_number=100000
df = pd.DataFrame(data)
df['Date']=pd.to_datetime(df['Date'])
df['Studying_hours']=((df.Rank<3)*df.Studying_hours).replace({0:np.NaN}) # 这比使用 lambda 的 apply 更高效
winning=df.sort_values('Date').groupby('Student_ID')['Studying_hours'].rolling(large_number,closed='left',min_periods=1).mean()
df['past_winning_hours_mean']=winning.reset_index(level=0, drop=True)
测试:
>>> df.sort_values(['Date', 'Student_ID'])
输出:
Race_ID Date Student_ID Rank Studying_hours past_winning_hours_mean
13 5 2021-01-01 1 2 2.0 NaN
14 5 2021-01-01 2 3 NaN NaN
15 5 2021-01-01 3 1 6.0 NaN
11 4 2022-01-03 1 3 NaN 2.000000
12 4 2022-01-03 2 2 2.0 NaN
8 3 2022-04-17 2 1 7.0 2.000000
9 3 2022-04-17 3 2 2.0 6.000000
10 3 2022-04-17 4 3 NaN NaN
7 3 2022-04-17 5 4 NaN NaN
4 2 2022-11-09 1 2 4.0 2.000000
5 2 2022-11-09 2 3 NaN 4.500000
6 2 2022-11-09 3 1 8.0 4.000000
0 1 2023-01-01 1 3 NaN 3.000000
1 1 2023-01-01 2 2 5.0 4.500000
2 1 2023-01-01 3 1 7.0 5.333333
3 1 2023-01-01 4 4 NaN NaN
我在一个包含 30000 行的数据集上对这段代码进行了性能测试:
6.07 毫秒 ± 45.2 微秒 每次循环(平均值 ± 7 次运行的标准差,每次循环 100 次)
英文:
We replace the studying hours for every competition a student didn't "win" with np.NaN
which has no impact on the calculation of the mean.
Use a window function rolling
with a large number to get an expanding window over the entries and compute the past running mean by specifying closed='left'
which discards the most recent entry.
Then we join back.
large_number=100000
df = pd.DataFrame(data)
df['Date']=pd.to_datetime(df['Date'])
df['Studying_hours']=((df.Rank<3)*df.Studying_hours).replace({0:np.NaN}) # This is more performant than an apply with a lambda
winning=df.sort_values('Date').groupby('Student_ID')['Studying_hours'].rolling(large_number,closed='left',min_periods=1).mean()
df['past_winning_hours_mean']=winning.reset_index(level=0, drop=True)
Test:
>>> df.sort_values(['Date', 'Student_ID'])
Output:
Race_ID Date Student_ID Rank Studying_hours past_winning_hours_mean
13 5 2021-01-01 1 2 2.0 NaN
14 5 2021-01-01 2 3 NaN NaN
15 5 2021-01-01 3 1 6.0 NaN
11 4 2022-01-03 1 3 NaN 2.000000
12 4 2022-01-03 2 2 2.0 NaN
8 3 2022-04-17 2 1 7.0 2.000000
9 3 2022-04-17 3 2 2.0 6.000000
10 3 2022-04-17 4 3 NaN NaN
7 3 2022-04-17 5 4 NaN NaN
4 2 2022-11-09 1 2 4.0 2.000000
5 2 2022-11-09 2 3 NaN 4.500000
6 2 2022-11-09 3 1 8.0 4.000000
0 1 2023-01-01 1 3 NaN 3.000000
1 1 2023-01-01 2 2 5.0 4.500000
2 1 2023-01-01 3 1 7.0 5.333333
3 1 2023-01-01 4 4 NaN NaN
I profiled this code on a dataset with 30000 rows:
6.07 ms ± 45.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论