2023年6月29日 16:12:44go评论168阅读模式

英文:

Pandas .groupby and .mean() based on conditions

问题

我有以下大型数据集，记录了数学竞赛结果，按日期降序排列的学生：例如，学生1在比赛1中获得第三名，而学生3赢得了比赛2，依此类推。

Race_ID   Date           Student_ID      Rank    Studying_hours
1         1/1/2023       1               3       2
1         1/1/2023       2               2       5
1         1/1/2023       3               1       7
1         1/1/2023       4               4       1
2         11/9/2022      1               2       4
2         11/9/2022      2               3       2
2         11/9/2022      3               1       8
3         17/4/2022      5               4       3
3         17/4/2022      2               1       7
3         17/4/2022      3               2       2
3         17/4/2022      4               3       3
4         1/3/2022       1               3       7
4         1/3/2022       2               2       2 
5         1/1/2021       1               2       2
5         1/1/2021       2               3       3
5         1/1/2021       3               1       6

我想生成一个名为"winning_past_studying_hours"的新列，其中包含他以前的比赛中获得第1或第2名的平均学习小时数。

例如，对于学生1：

Race_ID	Date	Student_ID	Rank	Studying_hours
1	1/1/2023	1	        3	    2
2	11/9/2022	1        	2	    4
4	1/3/2022	1        	3    	7
5	1/1/2021	1        	2    	2

该列如下所示：

Race_ID	Date	Student_ID	Rank	Studying_hours    winning_past_studying_hours
1	1/1/2023	1	        3	    2                 (4+2)/2 = 3
2	11/9/2022	1        	2	    4                 2/1 = 2
4	1/3/2022	1        	3    	7                 2/1= 2
5	1/1/2021	1        	2    	2                 NaN

同样，对于学生2：

Race_ID	Date	Student_ID	Rank	Studying_hours
1	1/1/2023	2        	2	    5
2	11/9/2022	2     		3	    2
3	17/4/2022	2     		1	    7
4	1/3/2022	2     		2	    2
5	1/1/2021	2     		3	    3

该列如下所示：

Race_ID	Date	Student_ID	Rank	Studying_hours    winning_past_studying_hours
1	1/1/2023	2        	2	    5                 (7+2)/2=4.5
2	11/9/2022	2     		3	    2                 (7+2)/2=4.5
3	17/4/2022	2     		1	    7                 2/1=2
4	1/3/2022	2     		2	    2                 NaN
5	1/1/2021	2     		3	    3                 NaN

我知道基本的groupby和mean函数，但我不知道如何在groupby函数中包含条件Rank.isin([1,2])。非常感谢。

英文:

I have the following large dataset recording the result of a math competition among students in descending order of date: So for example, student 1 comes third in Race 1 while student 3 won Race 2, etc.

Race_ID   Date           Student_ID      Rank    Studying_hours
1         1/1/2023       1               3       2
1         1/1/2023       2               2       5
1         1/1/2023       3               1       7
1         1/1/2023       4               4       1
2         11/9/2022      1               2       4
2         11/9/2022      2               3       2
2         11/9/2022      3               1       8
3         17/4/2022      5               4       3
3         17/4/2022      2               1       7
3         17/4/2022      3               2       2
3         17/4/2022      4               3       3
4         1/3/2022       1               3       7
4         1/3/2022       2               2       2 
5         1/1/2021       1               2       2
5         1/1/2021       2               3       3
5         1/1/2021       3               1       6

and I want to generate a new column called "winning_past_studying_hours" which is the average studying hours of his past competitions and where he ended up with Rank 1 or 2.

So for example, for student 1:

Race_ID	Date	Student_ID	Rank	Studying_hours
1	1/1/2023	1	        3	    2
2	11/9/2022	1        	2	    4
4	1/3/2022	1        	3    	7
5	1/1/2021	1        	2    	2

the column looks like

Race_ID	Date	Student_ID	Rank	Studying_hours    winning_past_studying_hours
1	1/1/2023	1	        3	    2                 (4+2)/2 = 3
2	11/9/2022	1        	2	    4                 2/1 = 2
4	1/3/2022	1        	3    	7                 2/1= 2
5	1/1/2021	1        	2    	2                 NaN

Similarly, for student 2:

Race_ID	Date	Student_ID	Rank	Studying_hours
1	1/1/2023	2        	2	    5
2	11/9/2022	2     		3	    2
3	17/4/2022	2     		1	    7
4	1/3/2022	2     		2	    2
5	1/1/2021	2     		3	    3

The column looks like

Race_ID	Date	Student_ID	Rank	Studying_hours    winning_past_studying_hours
1	1/1/2023	2        	2	    5                 (7+2)/2=4.5
2	11/9/2022	2     		3	    2                 (7+2)/2=4.5
3	17/4/2022	2     		1	    7                 2/1=2
4	1/3/2022	2     		2	    2                 NaN
5	1/1/2021	2     		3	    3                 NaN

I know the basic groupby and mean function but I do not know how to include the condition Rank.isin([1,2]) in the groupby function. Thank you so much.

EDIT: Desired output:

Race_ID   Date           Student_ID      Rank    Studying_hours  winning_past_studying_hours
1         1/1/2023       1               3       2               3
1         1/1/2023       2               2       5               4.5
1         1/1/2023       3               1       7               5.333
1         1/1/2023       4               4       1               NaN
2         11/9/2022      1               2       4               2
2         11/9/2022      2               3       2               4.5
2         11/9/2022      3               1       8               4
3         17/4/2022      5               4       3               NaN
3         17/4/2022      2               1       7               2
3         17/4/2022      3               2       2               6
3         17/4/2022      4               3       3               NaN
4         1/3/2022       1               3       7               2
4         1/3/2022       2               2       2               NaN
5         1/1/2021       1               2       2               NaN
5         1/1/2021       2               3       3               NaN
5         1/1/2021       3               1       6               NaN

答案1

得分: 4

我们用 np.NaN 替换了学生没有“赢得”比赛的每个比赛的学习小时数，对平均值的计算没有影响。

使用一个大数作为窗口函数 rolling，获得一个扩展窗口，并通过指定 closed='left' 计算过去的运行均值，该参数会丢弃最近的条目。

然后我们重新连接。

large_number=100000

df = pd.DataFrame(data)
df['Date']=pd.to_datetime(df['Date'])
df['Studying_hours']=((df.Rank<3)*df.Studying_hours).replace({0:np.NaN}) # 这比使用 lambda 的 apply 更高效
winning=df.sort_values('Date').groupby('Student_ID')['Studying_hours'].rolling(large_number,closed='left',min_periods=1).mean()
df['past_winning_hours_mean']=winning.reset_index(level=0, drop=True)

测试：

>>> df.sort_values(['Date', 'Student_ID'])

输出：

Race_ID Date       Student_ID Rank Studying_hours past_winning_hours_mean
13       5  2021-01-01  1     2               2.0                         NaN
14       5  2021-01-01  2     3               NaN                         NaN
15       5  2021-01-01  3     1               6.0                         NaN
11       4  2022-01-03  1     3               NaN                         2.000000
12       4  2022-01-03  2     2               2.0                         NaN
8        3  2022-04-17  2     1               7.0                         2.000000
9        3  2022-04-17  3     2               2.0                         6.000000
10       3  2022-04-17  4     3               NaN                         NaN
7        3  2022-04-17  5     4               NaN                         NaN
4        2  2022-11-09  1     2               4.0                         2.000000
5        2  2022-11-09  2     3               NaN                         4.500000
6        2  2022-11-09  3     1               8.0                         4.000000
0        1  2023-01-01  1     3               NaN                         3.000000
1        1  2023-01-01  2     2               5.0                         4.500000
2        1  2023-01-01  3     1               7.0                         5.333333
3        1  2023-01-01  4     4               NaN                         NaN

我在一个包含 30000 行的数据集上对这段代码进行了性能测试：

6.07 毫秒 ± 45.2 微秒 每次循环（平均值 ± 7 次运行的标准差，每次循环 100 次）

英文:

We replace the studying hours for every competition a student didn't "win" with np.NaN which has no impact on the calculation of the mean.

Use a window function rolling with a large number to get an expanding window over the entries and compute the past running mean by specifying closed='left' which discards the most recent entry.

Then we join back.



large_number=100000

df = pd.DataFrame(data)
df[&#39;Date&#39;]=pd.to_datetime(df[&#39;Date&#39;])
df[&#39;Studying_hours&#39;]=((df.Rank&lt;3)*df.Studying_hours).replace({0:np.NaN}) # This is more performant than an apply with a lambda
winning=df.sort_values(&#39;Date&#39;).groupby(&#39;Student_ID&#39;)[&#39;Studying_hours&#39;].rolling(large_number,closed=&#39;left&#39;,min_periods=1).mean()
df[&#39;past_winning_hours_mean&#39;]=winning.reset_index(level=0, drop=True)

Test:

&gt;&gt;&gt; df.sort_values([&#39;Date&#39;, &#39;Student_ID&#39;])

Output:

Race_ID	Date	Student_ID	Rank	Studying_hours	past_winning_hours_mean
13	5	2021-01-01	1	2	2.0	NaN
14	5	2021-01-01	2	3	NaN	NaN
15	5	2021-01-01	3	1	6.0	NaN
11	4	2022-01-03	1	3	NaN	2.000000
12	4	2022-01-03	2	2	2.0	NaN
8	3	2022-04-17	2	1	7.0	2.000000
9	3	2022-04-17	3	2	2.0	6.000000
10	3	2022-04-17	4	3	NaN	NaN
7	3	2022-04-17	5	4	NaN	NaN
4	2	2022-11-09	1	2	4.0	2.000000
5	2	2022-11-09	2	3	NaN	4.500000
6	2	2022-11-09	3	1	8.0	4.000000
0	1	2023-01-01	1	3	NaN	3.000000
1	1	2023-01-01	2	2	5.0	4.500000
2	1	2023-01-01	3	1	7.0	5.333333
3	1	2023-01-01	4	4	NaN	NaN

I profiled this code on a dataset with 30000 rows:

6.07 ms &#177; 45.2 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas 根据条件使用 .groupby 和 .mean()。

问题

答案1

递归函数返回一个列表

在Python中遇到了难以理解的参数、参数传递以及向函数传递变量的问题。

如何同时运行两个优化任务

在Azure Synapse中添加自定义Python库

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论