Pandas 根据条件使用 .groupby 和 .mean()。

huangapple go评论72阅读模式
英文:

Pandas .groupby and .mean() based on conditions

问题

我有以下大型数据集,记录了数学竞赛结果,按日期降序排列的学生:例如,学生1在比赛1中获得第三名,而学生3赢得了比赛2,依此类推。

Race_ID   Date           Student_ID      Rank    Studying_hours
1         1/1/2023       1               3       2
1         1/1/2023       2               2       5
1         1/1/2023       3               1       7
1         1/1/2023       4               4       1
2         11/9/2022      1               2       4
2         11/9/2022      2               3       2
2         11/9/2022      3               1       8
3         17/4/2022      5               4       3
3         17/4/2022      2               1       7
3         17/4/2022      3               2       2
3         17/4/2022      4               3       3
4         1/3/2022       1               3       7
4         1/3/2022       2               2       2 
5         1/1/2021       1               2       2
5         1/1/2021       2               3       3
5         1/1/2021       3               1       6

我想生成一个名为"winning_past_studying_hours"的新列,其中包含他以前的比赛中获得第1或第2名的平均学习小时数。

例如,对于学生1:

Race_ID	Date	Student_ID	Rank	Studying_hours
1	1/1/2023	1	        3	    2
2	11/9/2022	1        	2	    4
4	1/3/2022	1        	3    	7
5	1/1/2021	1        	2    	2

该列如下所示:

Race_ID	Date	Student_ID	Rank	Studying_hours    winning_past_studying_hours
1	1/1/2023	1	        3	    2                 (4+2)/2 = 3
2	11/9/2022	1        	2	    4                 2/1 = 2
4	1/3/2022	1        	3    	7                 2/1= 2
5	1/1/2021	1        	2    	2                 NaN

同样,对于学生2:

Race_ID	Date	Student_ID	Rank	Studying_hours
1	1/1/2023	2        	2	    5
2	11/9/2022	2     		3	    2
3	17/4/2022	2     		1	    7
4	1/3/2022	2     		2	    2
5	1/1/2021	2     		3	    3

该列如下所示:

Race_ID	Date	Student_ID	Rank	Studying_hours    winning_past_studying_hours
1	1/1/2023	2        	2	    5                 (7+2)/2=4.5
2	11/9/2022	2     		3	    2                 (7+2)/2=4.5
3	17/4/2022	2     		1	    7                 2/1=2
4	1/3/2022	2     		2	    2                 NaN
5	1/1/2021	2     		3	    3                 NaN

我知道基本的groupbymean函数,但我不知道如何在groupby函数中包含条件Rank.isin([1,2])。非常感谢。

英文:

I have the following large dataset recording the result of a math competition among students in descending order of date: So for example, student 1 comes third in Race 1 while student 3 won Race 2, etc.

Race_ID   Date           Student_ID      Rank    Studying_hours
1         1/1/2023       1               3       2
1         1/1/2023       2               2       5
1         1/1/2023       3               1       7
1         1/1/2023       4               4       1
2         11/9/2022      1               2       4
2         11/9/2022      2               3       2
2         11/9/2022      3               1       8
3         17/4/2022      5               4       3
3         17/4/2022      2               1       7
3         17/4/2022      3               2       2
3         17/4/2022      4               3       3
4         1/3/2022       1               3       7
4         1/3/2022       2               2       2 
5         1/1/2021       1               2       2
5         1/1/2021       2               3       3
5         1/1/2021       3               1       6

and I want to generate a new column called "winning_past_studying_hours" which is the average studying hours of his past competitions and where he ended up with Rank 1 or 2.

So for example, for student 1:

Race_ID	Date	Student_ID	Rank	Studying_hours
1	1/1/2023	1	        3	    2
2	11/9/2022	1        	2	    4
4	1/3/2022	1        	3    	7
5	1/1/2021	1        	2    	2

the column looks like

Race_ID	Date	Student_ID	Rank	Studying_hours    winning_past_studying_hours
1	1/1/2023	1	        3	    2                 (4+2)/2 = 3
2	11/9/2022	1        	2	    4                 2/1 = 2
4	1/3/2022	1        	3    	7                 2/1= 2
5	1/1/2021	1        	2    	2                 NaN

Similarly, for student 2:

Race_ID	Date	Student_ID	Rank	Studying_hours
1	1/1/2023	2        	2	    5
2	11/9/2022	2     		3	    2
3	17/4/2022	2     		1	    7
4	1/3/2022	2     		2	    2
5	1/1/2021	2     		3	    3

The column looks like

Race_ID	Date	Student_ID	Rank	Studying_hours    winning_past_studying_hours
1	1/1/2023	2        	2	    5                 (7+2)/2=4.5
2	11/9/2022	2     		3	    2                 (7+2)/2=4.5
3	17/4/2022	2     		1	    7                 2/1=2
4	1/3/2022	2     		2	    2                 NaN
5	1/1/2021	2     		3	    3                 NaN

I know the basic groupby and mean function but I do not know how to include the condition Rank.isin([1,2]) in the groupby function. Thank you so much.

EDIT: Desired output:

Race_ID   Date           Student_ID      Rank    Studying_hours  winning_past_studying_hours
1         1/1/2023       1               3       2               3
1         1/1/2023       2               2       5               4.5
1         1/1/2023       3               1       7               5.333
1         1/1/2023       4               4       1               NaN
2         11/9/2022      1               2       4               2
2         11/9/2022      2               3       2               4.5
2         11/9/2022      3               1       8               4
3         17/4/2022      5               4       3               NaN
3         17/4/2022      2               1       7               2
3         17/4/2022      3               2       2               6
3         17/4/2022      4               3       3               NaN
4         1/3/2022       1               3       7               2
4         1/3/2022       2               2       2               NaN
5         1/1/2021       1               2       2               NaN
5         1/1/2021       2               3       3               NaN
5         1/1/2021       3               1       6               NaN

答案1

得分: 4

我们用 np.NaN 替换了学生没有“赢得”比赛的每个比赛的学习小时数,对平均值的计算没有影响。

使用一个大数作为窗口函数 rolling,获得一个扩展窗口,并通过指定 closed='left' 计算过去的运行均值,该参数会丢弃最近的条目。

然后我们重新连接。

large_number=100000

df = pd.DataFrame(data)
df['Date']=pd.to_datetime(df['Date'])
df['Studying_hours']=((df.Rank<3)*df.Studying_hours).replace({0:np.NaN}) # 这比使用 lambda 的 apply 更高效
winning=df.sort_values('Date').groupby('Student_ID')['Studying_hours'].rolling(large_number,closed='left',min_periods=1).mean()
df['past_winning_hours_mean']=winning.reset_index(level=0, drop=True)

测试:

>>> df.sort_values(['Date', 'Student_ID'])

输出:

Race_ID Date       Student_ID Rank Studying_hours past_winning_hours_mean
13       5  2021-01-01  1     2               2.0                         NaN
14       5  2021-01-01  2     3               NaN                         NaN
15       5  2021-01-01  3     1               6.0                         NaN
11       4  2022-01-03  1     3               NaN                         2.000000
12       4  2022-01-03  2     2               2.0                         NaN
8        3  2022-04-17  2     1               7.0                         2.000000
9        3  2022-04-17  3     2               2.0                         6.000000
10       3  2022-04-17  4     3               NaN                         NaN
7        3  2022-04-17  5     4               NaN                         NaN
4        2  2022-11-09  1     2               4.0                         2.000000
5        2  2022-11-09  2     3               NaN                         4.500000
6        2  2022-11-09  3     1               8.0                         4.000000
0        1  2023-01-01  1     3               NaN                         3.000000
1        1  2023-01-01  2     2               5.0                         4.500000
2        1  2023-01-01  3     1               7.0                         5.333333
3        1  2023-01-01  4     4               NaN                         NaN

我在一个包含 30000 行的数据集上对这段代码进行了性能测试:

6.07 毫秒 ± 45.2 微秒 每次循环(平均值 ± 7 次运行的标准差,每次循环 100 次)
英文:

We replace the studying hours for every competition a student didn't "win" with np.NaN which has no impact on the calculation of the mean.

Use a window function rolling with a large number to get an expanding window over the entries and compute the past running mean by specifying closed=&#39;left&#39; which discards the most recent entry.

Then we join back.



large_number=100000

df = pd.DataFrame(data)
df[&#39;Date&#39;]=pd.to_datetime(df[&#39;Date&#39;])
df[&#39;Studying_hours&#39;]=((df.Rank&lt;3)*df.Studying_hours).replace({0:np.NaN}) # This is more performant than an apply with a lambda
winning=df.sort_values(&#39;Date&#39;).groupby(&#39;Student_ID&#39;)[&#39;Studying_hours&#39;].rolling(large_number,closed=&#39;left&#39;,min_periods=1).mean()
df[&#39;past_winning_hours_mean&#39;]=winning.reset_index(level=0, drop=True)

Test:

&gt;&gt;&gt; df.sort_values([&#39;Date&#39;, &#39;Student_ID&#39;])

Output:

Race_ID	Date	Student_ID	Rank	Studying_hours	past_winning_hours_mean
13	5	2021-01-01	1	2	2.0	NaN
14	5	2021-01-01	2	3	NaN	NaN
15	5	2021-01-01	3	1	6.0	NaN
11	4	2022-01-03	1	3	NaN	2.000000
12	4	2022-01-03	2	2	2.0	NaN
8	3	2022-04-17	2	1	7.0	2.000000
9	3	2022-04-17	3	2	2.0	6.000000
10	3	2022-04-17	4	3	NaN	NaN
7	3	2022-04-17	5	4	NaN	NaN
4	2	2022-11-09	1	2	4.0	2.000000
5	2	2022-11-09	2	3	NaN	4.500000
6	2	2022-11-09	3	1	8.0	4.000000
0	1	2023-01-01	1	3	NaN	3.000000
1	1	2023-01-01	2	2	5.0	4.500000
2	1	2023-01-01	3	1	7.0	5.333333
3	1	2023-01-01	4	4	NaN	NaN

I profiled this code on a dataset with 30000 rows:

6.07 ms &#177; 45.2 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)

huangapple
  • 本文由 发表于 2023年6月29日 16:12:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76579195.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定