Pandas 根据条件使用 .groupby 和 .mean()。

huangapple go评论97阅读模式
英文:

Pandas .groupby and .mean() based on conditions

问题

我有以下大型数据集,记录了数学竞赛结果,按日期降序排列的学生:例如,学生1在比赛1中获得第三名,而学生3赢得了比赛2,依此类推。

  1. Race_ID Date Student_ID Rank Studying_hours
  2. 1 1/1/2023 1 3 2
  3. 1 1/1/2023 2 2 5
  4. 1 1/1/2023 3 1 7
  5. 1 1/1/2023 4 4 1
  6. 2 11/9/2022 1 2 4
  7. 2 11/9/2022 2 3 2
  8. 2 11/9/2022 3 1 8
  9. 3 17/4/2022 5 4 3
  10. 3 17/4/2022 2 1 7
  11. 3 17/4/2022 3 2 2
  12. 3 17/4/2022 4 3 3
  13. 4 1/3/2022 1 3 7
  14. 4 1/3/2022 2 2 2
  15. 5 1/1/2021 1 2 2
  16. 5 1/1/2021 2 3 3
  17. 5 1/1/2021 3 1 6

我想生成一个名为"winning_past_studying_hours"的新列,其中包含他以前的比赛中获得第1或第2名的平均学习小时数。

例如,对于学生1:

  1. Race_ID Date Student_ID Rank Studying_hours
  2. 1 1/1/2023 1 3 2
  3. 2 11/9/2022 1 2 4
  4. 4 1/3/2022 1 3 7
  5. 5 1/1/2021 1 2 2

该列如下所示:

  1. Race_ID Date Student_ID Rank Studying_hours winning_past_studying_hours
  2. 1 1/1/2023 1 3 2 (4+2)/2 = 3
  3. 2 11/9/2022 1 2 4 2/1 = 2
  4. 4 1/3/2022 1 3 7 2/1= 2
  5. 5 1/1/2021 1 2 2 NaN

同样,对于学生2:

  1. Race_ID Date Student_ID Rank Studying_hours
  2. 1 1/1/2023 2 2 5
  3. 2 11/9/2022 2 3 2
  4. 3 17/4/2022 2 1 7
  5. 4 1/3/2022 2 2 2
  6. 5 1/1/2021 2 3 3

该列如下所示:

  1. Race_ID Date Student_ID Rank Studying_hours winning_past_studying_hours
  2. 1 1/1/2023 2 2 5 (7+2)/2=4.5
  3. 2 11/9/2022 2 3 2 (7+2)/2=4.5
  4. 3 17/4/2022 2 1 7 2/1=2
  5. 4 1/3/2022 2 2 2 NaN
  6. 5 1/1/2021 2 3 3 NaN

我知道基本的groupbymean函数,但我不知道如何在groupby函数中包含条件Rank.isin([1,2])。非常感谢。

英文:

I have the following large dataset recording the result of a math competition among students in descending order of date: So for example, student 1 comes third in Race 1 while student 3 won Race 2, etc.

  1. Race_ID Date Student_ID Rank Studying_hours
  2. 1 1/1/2023 1 3 2
  3. 1 1/1/2023 2 2 5
  4. 1 1/1/2023 3 1 7
  5. 1 1/1/2023 4 4 1
  6. 2 11/9/2022 1 2 4
  7. 2 11/9/2022 2 3 2
  8. 2 11/9/2022 3 1 8
  9. 3 17/4/2022 5 4 3
  10. 3 17/4/2022 2 1 7
  11. 3 17/4/2022 3 2 2
  12. 3 17/4/2022 4 3 3
  13. 4 1/3/2022 1 3 7
  14. 4 1/3/2022 2 2 2
  15. 5 1/1/2021 1 2 2
  16. 5 1/1/2021 2 3 3
  17. 5 1/1/2021 3 1 6

and I want to generate a new column called "winning_past_studying_hours" which is the average studying hours of his past competitions and where he ended up with Rank 1 or 2.

So for example, for student 1:

  1. Race_ID Date Student_ID Rank Studying_hours
  2. 1 1/1/2023 1 3 2
  3. 2 11/9/2022 1 2 4
  4. 4 1/3/2022 1 3 7
  5. 5 1/1/2021 1 2 2

the column looks like

  1. Race_ID Date Student_ID Rank Studying_hours winning_past_studying_hours
  2. 1 1/1/2023 1 3 2 (4+2)/2 = 3
  3. 2 11/9/2022 1 2 4 2/1 = 2
  4. 4 1/3/2022 1 3 7 2/1= 2
  5. 5 1/1/2021 1 2 2 NaN

Similarly, for student 2:

  1. Race_ID Date Student_ID Rank Studying_hours
  2. 1 1/1/2023 2 2 5
  3. 2 11/9/2022 2 3 2
  4. 3 17/4/2022 2 1 7
  5. 4 1/3/2022 2 2 2
  6. 5 1/1/2021 2 3 3

The column looks like

  1. Race_ID Date Student_ID Rank Studying_hours winning_past_studying_hours
  2. 1 1/1/2023 2 2 5 (7+2)/2=4.5
  3. 2 11/9/2022 2 3 2 (7+2)/2=4.5
  4. 3 17/4/2022 2 1 7 2/1=2
  5. 4 1/3/2022 2 2 2 NaN
  6. 5 1/1/2021 2 3 3 NaN

I know the basic groupby and mean function but I do not know how to include the condition Rank.isin([1,2]) in the groupby function. Thank you so much.

EDIT: Desired output:

  1. Race_ID Date Student_ID Rank Studying_hours winning_past_studying_hours
  2. 1 1/1/2023 1 3 2 3
  3. 1 1/1/2023 2 2 5 4.5
  4. 1 1/1/2023 3 1 7 5.333
  5. 1 1/1/2023 4 4 1 NaN
  6. 2 11/9/2022 1 2 4 2
  7. 2 11/9/2022 2 3 2 4.5
  8. 2 11/9/2022 3 1 8 4
  9. 3 17/4/2022 5 4 3 NaN
  10. 3 17/4/2022 2 1 7 2
  11. 3 17/4/2022 3 2 2 6
  12. 3 17/4/2022 4 3 3 NaN
  13. 4 1/3/2022 1 3 7 2
  14. 4 1/3/2022 2 2 2 NaN
  15. 5 1/1/2021 1 2 2 NaN
  16. 5 1/1/2021 2 3 3 NaN
  17. 5 1/1/2021 3 1 6 NaN

答案1

得分: 4

我们用 np.NaN 替换了学生没有“赢得”比赛的每个比赛的学习小时数,对平均值的计算没有影响。

使用一个大数作为窗口函数 rolling,获得一个扩展窗口,并通过指定 closed='left' 计算过去的运行均值,该参数会丢弃最近的条目。

然后我们重新连接。

  1. large_number=100000
  2. df = pd.DataFrame(data)
  3. df['Date']=pd.to_datetime(df['Date'])
  4. df['Studying_hours']=((df.Rank<3)*df.Studying_hours).replace({0:np.NaN}) # 这比使用 lambda 的 apply 更高效
  5. winning=df.sort_values('Date').groupby('Student_ID')['Studying_hours'].rolling(large_number,closed='left',min_periods=1).mean()
  6. df['past_winning_hours_mean']=winning.reset_index(level=0, drop=True)

测试:

  1. >>> df.sort_values(['Date', 'Student_ID'])

输出:

  1. Race_ID Date Student_ID Rank Studying_hours past_winning_hours_mean
  2. 13 5 2021-01-01 1 2 2.0 NaN
  3. 14 5 2021-01-01 2 3 NaN NaN
  4. 15 5 2021-01-01 3 1 6.0 NaN
  5. 11 4 2022-01-03 1 3 NaN 2.000000
  6. 12 4 2022-01-03 2 2 2.0 NaN
  7. 8 3 2022-04-17 2 1 7.0 2.000000
  8. 9 3 2022-04-17 3 2 2.0 6.000000
  9. 10 3 2022-04-17 4 3 NaN NaN
  10. 7 3 2022-04-17 5 4 NaN NaN
  11. 4 2 2022-11-09 1 2 4.0 2.000000
  12. 5 2 2022-11-09 2 3 NaN 4.500000
  13. 6 2 2022-11-09 3 1 8.0 4.000000
  14. 0 1 2023-01-01 1 3 NaN 3.000000
  15. 1 1 2023-01-01 2 2 5.0 4.500000
  16. 2 1 2023-01-01 3 1 7.0 5.333333
  17. 3 1 2023-01-01 4 4 NaN NaN

我在一个包含 30000 行的数据集上对这段代码进行了性能测试:

  1. 6.07 毫秒 ± 45.2 微秒 每次循环(平均值 ± 7 次运行的标准差,每次循环 100 次)
英文:

We replace the studying hours for every competition a student didn't "win" with np.NaN which has no impact on the calculation of the mean.

Use a window function rolling with a large number to get an expanding window over the entries and compute the past running mean by specifying closed=&#39;left&#39; which discards the most recent entry.

Then we join back.

  1. large_number=100000
  2. df = pd.DataFrame(data)
  3. df[&#39;Date&#39;]=pd.to_datetime(df[&#39;Date&#39;])
  4. df[&#39;Studying_hours&#39;]=((df.Rank&lt;3)*df.Studying_hours).replace({0:np.NaN}) # This is more performant than an apply with a lambda
  5. winning=df.sort_values(&#39;Date&#39;).groupby(&#39;Student_ID&#39;)[&#39;Studying_hours&#39;].rolling(large_number,closed=&#39;left&#39;,min_periods=1).mean()
  6. df[&#39;past_winning_hours_mean&#39;]=winning.reset_index(level=0, drop=True)

Test:

  1. &gt;&gt;&gt; df.sort_values([&#39;Date&#39;, &#39;Student_ID&#39;])

Output:

  1. Race_ID Date Student_ID Rank Studying_hours past_winning_hours_mean
  2. 13 5 2021-01-01 1 2 2.0 NaN
  3. 14 5 2021-01-01 2 3 NaN NaN
  4. 15 5 2021-01-01 3 1 6.0 NaN
  5. 11 4 2022-01-03 1 3 NaN 2.000000
  6. 12 4 2022-01-03 2 2 2.0 NaN
  7. 8 3 2022-04-17 2 1 7.0 2.000000
  8. 9 3 2022-04-17 3 2 2.0 6.000000
  9. 10 3 2022-04-17 4 3 NaN NaN
  10. 7 3 2022-04-17 5 4 NaN NaN
  11. 4 2 2022-11-09 1 2 4.0 2.000000
  12. 5 2 2022-11-09 2 3 NaN 4.500000
  13. 6 2 2022-11-09 3 1 8.0 4.000000
  14. 0 1 2023-01-01 1 3 NaN 3.000000
  15. 1 1 2023-01-01 2 2 5.0 4.500000
  16. 2 1 2023-01-01 3 1 7.0 5.333333
  17. 3 1 2023-01-01 4 4 NaN NaN

I profiled this code on a dataset with 30000 rows:

  1. 6.07 ms &#177; 45.2 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)

huangapple
  • 本文由 发表于 2023年6月29日 16:12:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76579195.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定