基于 Pandas 数据框中上次行的数值创建新列等于某些数字。

huangapple go评论95阅读模式
英文:

New column based on last time row value equals some numbers in Pandas dataframe

问题

我已经将您提供的内容翻译成中文,以下是代码部分的中文翻译:

  1. # 创建一个名为Recent_Predicted_Score的列,记录学生在前3名中的最近预测分数
  2. df.sort_values(by=['Student_ID', 'Date'], ascending=[True, False], inplace=True)
  3. lp1 = df['Predicted_Score'].where(df['Rank'].isin([1, 2, 3])).groupby(df['Student_ID']).bfill()
  4. lp2 = df.groupby(['Student_ID', 'Rank'])['Predicted_Score'].shift(-1)
  5. df = df.assign(Recent_Predicted_Score=lp1.mask(df['Rank'].isin([1, 2, 3]), lp2))

请注意,上述代码是用于创建"Recent_Predicted_Score"列的Python代码,用于记录学生在前三名中的最近预测分数。如果您有任何其他问题或需要进一步的帮助,请随时告诉我。

英文:

I have a dataframe sorted in descending order date that records the Rank of students in class and the predicted score.

  1. Date Student_ID Rank Predicted_Score
  2. 4/7/2021 33 2 87
  3. 13/6/2021 33 4 88
  4. 31/3/2021 33 7 88
  5. 28/2/2021 33 2 86
  6. 14/2/2021 33 10 86
  7. 31/1/2021 33 8 86
  8. 23/12/2020 33 1 81
  9. 8/11/2020 33 3 80
  10. 21/10/2020 33 3 80
  11. 23/9/2020 33 4 80
  12. 20/5/2020 33 3 80
  13. 29/4/2020 33 4 80
  14. 15/4/2020 33 2 79
  15. 26/2/2020 33 3 79
  16. 12/2/2020 33 5 79
  17. 29/1/2020 33 1 70

I want to create a column called Recent_Predicted_Score that record the last predicted_score where that student actually ranks top 3. So the desired outcome looks like

  1. Date Student_ID Rank Predicted_Score Recent_Predicted_Score
  2. 4/7/2021 33 2 87 86
  3. 13/6/2021 33 4 88 86
  4. 31/3/2021 33 7 88 86
  5. 28/2/2021 33 2 86 81
  6. 14/2/2021 33 10 86 81
  7. 31/1/2021 33 8 86 81
  8. 23/12/2020 33 1 81 80
  9. 8/11/2020 33 3 80 80
  10. 21/10/2020 33 3 80 80
  11. 23/9/2020 33 4 80 80
  12. 20/5/2020 33 3 80 79
  13. 29/4/2020 33 4 80 79
  14. 15/4/2020 33 2 79 79
  15. 26/2/2020 33 3 79 70
  16. 12/2/2020 33 5 79 70
  17. 29/1/2020 33 1 70

Here's what I have tried but it doesn't quite work, not sure if I am on the right track:

  1. df.sort_values(by = ['Student_ID', 'Date'], ascending = [True, False], inplace = True)
  2. lp1 = df['Predicted_Score'].where(df['Rank'].isin([1,2,3])).groupby(df['Student_ID']).bfill()
  3. lp2 = df.groupby(['Student_ID', 'Rank'])['Predicted_Score'].shift(-1)
  4. df = df.assign(Recent_Predicted_Score=lp1.mask(df['Rank'].isin([1,2,3]), lp2))

Thanks in advance.

答案1

得分: 2

尝试:

  1. df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
  2. df = df.sort_values(['Student_ID', 'Date'])
  3. df['Recent_Predicted_Score'] = np.where(df['Rank'].isin([1, 2, 3]), df['Predicted_Score'], np.nan)
  4. df['Recent_Predicted_Score'] = df.groupby('Student_ID', group_keys=False)['Recent_Predicted_Score'].apply(lambda x: x.ffill().shift().fillna(''))
  5. df = df.sort_values(['Student_ID', 'Date'], ascending=[True, False])
  6. print(df)

打印结果:

  1. Date Student_ID Rank Predicted_Score Recent_Predicted_Score
  2. 0 2021-07-04 33 2 87 86.0
  3. 1 2021-06-13 33 4 88 86.0
  4. 2 2021-03-31 33 7 88 86.0
  5. 3 2021-02-28 33 2 86 81.0
  6. 4 2021-02-14 33 10 86 81.0
  7. 5 2021-01-31 33 8 86 81.0
  8. 6 2020-12-23 33 1 81 80.0
  9. 7 2020-11-08 33 3 80 80.0
  10. 8 2020-10-21 33 3 80 80.0
  11. 9 2020-09-23 33 4 80 80.0
  12. 10 2020-05-20 33 3 80 79.0
  13. 11 2020-04-29 33 4 80 79.0
  14. 12 2020-04-15 33 2 79 79.0
  15. 13 2020-02-26 33 3 79 70.0
  16. 14 2020-02-12 33 5 79 70.0
  17. 15 2020-01-29 33 1 70
英文:

Try:

  1. df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
  2. df = df.sort_values(['Student_ID', 'Date'])
  3. df['Recent_Predicted_Score'] = np.where(df['Rank'].isin([1, 2, 3]), df['Predicted_Score'], np.nan)
  4. df['Recent_Predicted_Score'] = df.groupby('Student_ID', group_keys=False)['Recent_Predicted_Score'].apply(lambda x: x.ffill().shift().fillna(''))
  5. df = df.sort_values(['Student_ID', 'Date'], ascending = [True, False])
  6. print(df)

Prints:

  1. Date Student_ID Rank Predicted_Score Recent_Predicted_Score
  2. 0 2021-07-04 33 2 87 86.0
  3. 1 2021-06-13 33 4 88 86.0
  4. 2 2021-03-31 33 7 88 86.0
  5. 3 2021-02-28 33 2 86 81.0
  6. 4 2021-02-14 33 10 86 81.0
  7. 5 2021-01-31 33 8 86 81.0
  8. 6 2020-12-23 33 1 81 80.0
  9. 7 2020-11-08 33 3 80 80.0
  10. 8 2020-10-21 33 3 80 80.0
  11. 9 2020-09-23 33 4 80 80.0
  12. 10 2020-05-20 33 3 80 79.0
  13. 11 2020-04-29 33 4 80 79.0
  14. 12 2020-04-15 33 2 79 79.0
  15. 13 2020-02-26 33 3 79 70.0
  16. 14 2020-02-12 33 5 79 70.0
  17. 15 2020-01-29 33 1 70

答案2

得分: 1

让我们假设:

  • 可能存在多个唯一的 Student_ID
  • 行按照 OP 指定的降序 Date 排序,但不一定按 Student_ID 排序
  • 我们希望保留原始数据框的索引

在满足这些假设的情况下,以下是实现您问题要求的方法:

  1. df['Recent_Predicted_Score'] = df.loc[df['Rank'] <= 3, 'Predicted_Score']
  2. df['Recent_Predicted_Score'] = ( df
  3. .groupby('Student_ID', sort=False)
  4. .apply(lambda group: group['Predicted_Score'].shift(-1).bfill())
  5. ['Recent_Predicted_Score'] )

解释:

  • 创建一个新列 Recent_Predicted_Score,其中包含 Rank 在前三名的行的 Predicted_Score,其他行为 NaN
  • 使用 groupby() 根据 Student_ID 分组,将 sort 参数设置为 False 以提高性能(注意,groupby() 保留每个组内行的顺序,具体而言,不影响现有的按 Date 降序排序的顺序)
  • 在每个组内,使用 shift(-1)bfill() 来获得 Recent_Predicted_Score 的期望结果。

以下是示例输入(包含两个不同的 Student_ID 值)和输出,已根据 Student_ID, Date 进行排序以便更轻松检查:

  1. (示例输入和输出数据已省略)

希望这对您有所帮助。

英文:

Let's assume:

  • there may be more than one unique Student_ID
  • the rows are ordered by descending Date as indicated by OP, but may not be ordered by Student_ID
  • we want to preserve the index of the original dataframe

Subject to these assumptions, here's a way to do what your question asks:

  1. df[&#39;Recent_Predicted_Score&#39;] = df.loc[df.Rank &lt;= 3, &#39;Predicted_Score&#39;]
  2. df[&#39;Recent_Predicted_Score&#39;] = ( df
  3. .groupby(&#39;Student_ID&#39;, sort=False)
  4. .apply(lambda group: group.shift(-1).bfill())
  5. [&#39;Recent_Predicted_Score&#39;] )

Explanation:

  • create a new column Recent_Predicted_Score containing the PredictedScore where Rank is in the top 3 and NaN otherwise
  • use groupby() on Student_ID with the sort argument set to False for better performance (note that groupby() preserves the order of rows within each group, specifically, not influencing the existing descending order by Date)
  • within each group, do shift(-1) and bfill() to get the desired result for Recent_Predicted_Score.

Sample input (with two distinct Student_ID values):

  1. Date Student_ID Rank Predicted_Score
  2. 0 2021-07-04 33 2 87
  3. 1 2021-07-04 66 2 87
  4. 2 2021-06-13 33 4 88
  5. 3 2021-06-13 66 4 88
  6. 4 2021-03-31 33 7 88
  7. 5 2021-03-31 66 7 88
  8. 6 2021-02-28 33 2 86
  9. 7 2021-02-28 66 2 86
  10. 8 2021-02-14 33 10 86
  11. 9 2021-02-14 66 10 86
  12. 10 2021-01-31 33 8 86
  13. 11 2021-01-31 66 8 86
  14. 12 2020-12-23 33 1 81
  15. 13 2020-12-23 66 1 81
  16. 14 2020-11-08 33 3 80
  17. 15 2020-11-08 66 3 80
  18. 16 2020-10-21 33 3 80
  19. 17 2020-10-21 66 3 80
  20. 18 2020-09-23 33 4 80
  21. 19 2020-09-23 66 4 80
  22. 20 2020-05-20 33 3 80
  23. 21 2020-05-20 66 3 80
  24. 22 2020-04-29 33 4 80
  25. 23 2020-04-29 66 4 80
  26. 24 2020-04-15 33 2 79
  27. 25 2020-04-15 66 2 79
  28. 26 2020-02-26 33 3 79
  29. 27 2020-02-26 66 3 79
  30. 28 2020-02-12 33 5 79
  31. 29 2020-02-12 66 5 79
  32. 30 2020-01-29 33 1 70
  33. 31 2020-01-29 66 1 70

Output:

  1. Date Student_ID Rank Predicted_Score Recent_Predicted_Score
  2. 0 2021-07-04 33 2 87 86.0
  3. 1 2021-07-04 66 2 87 86.0
  4. 2 2021-06-13 33 4 88 86.0
  5. 3 2021-06-13 66 4 88 86.0
  6. 4 2021-03-31 33 7 88 86.0
  7. 5 2021-03-31 66 7 88 86.0
  8. 6 2021-02-28 33 2 86 81.0
  9. 7 2021-02-28 66 2 86 81.0
  10. 8 2021-02-14 33 10 86 81.0
  11. 9 2021-02-14 66 10 86 81.0
  12. 10 2021-01-31 33 8 86 81.0
  13. 11 2021-01-31 66 8 86 81.0
  14. 12 2020-12-23 33 1 81 80.0
  15. 13 2020-12-23 66 1 81 80.0
  16. 14 2020-11-08 33 3 80 80.0
  17. 15 2020-11-08 66 3 80 80.0
  18. 16 2020-10-21 33 3 80 80.0
  19. 17 2020-10-21 66 3 80 80.0
  20. 18 2020-09-23 33 4 80 80.0
  21. 19 2020-09-23 66 4 80 80.0
  22. 20 2020-05-20 33 3 80 79.0
  23. 21 2020-05-20 66 3 80 79.0
  24. 22 2020-04-29 33 4 80 79.0
  25. 23 2020-04-29 66 4 80 79.0
  26. 24 2020-04-15 33 2 79 79.0
  27. 25 2020-04-15 66 2 79 79.0
  28. 26 2020-02-26 33 3 79 70.0
  29. 27 2020-02-26 66 3 79 70.0
  30. 28 2020-02-12 33 5 79 70.0
  31. 29 2020-02-12 66 5 79 70.0
  32. 30 2020-01-29 33 1 70 NaN
  33. 31 2020-01-29 66 1 70 NaN

Output sorted by Student_ID, Date for easier inspection:

  1. Date Student_ID Rank Predicted_Score Recent_Predicted_Score
  2. 0 2021-07-04 33 2 87 86.0
  3. 2 2021-06-13 33 4 88 86.0
  4. 4 2021-03-31 33 7 88 86.0
  5. 6 2021-02-28 33 2 86 81.0
  6. 8 2021-02-14 33 10 86 81.0
  7. 10 2021-01-31 33 8 86 81.0
  8. 12 2020-12-23 33 1 81 80.0
  9. 14 2020-11-08 33 3 80 80.0
  10. 16 2020-10-21 33 3 80 80.0
  11. 18 2020-09-23 33 4 80 80.0
  12. 20 2020-05-20 33 3 80 79.0
  13. 22 2020-04-29 33 4 80 79.0
  14. 24 2020-04-15 33 2 79 79.0
  15. 26 2020-02-26 33 3 79 70.0
  16. 28 2020-02-12 33 5 79 70.0
  17. 30 2020-01-29 33 1 70 NaN
  18. 1 2021-07-04 66 2 87 86.0
  19. 3 2021-06-13 66 4 88 86.0
  20. 5 2021-03-31 66 7 88 86.0
  21. 7 2021-02-28 66 2 86 81.0
  22. 9 2021-02-14 66 10 86 81.0
  23. 11 2021-01-31 66 8 86 81.0
  24. 13 2020-12-23 66 1 81 80.0
  25. 15 2020-11-08 66 3 80 80.0
  26. 17 2020-10-21 66 3 80 80.0
  27. 19 2020-09-23 66 4 80 80.0
  28. 21 2020-05-20 66 3 80 79.0
  29. 23 2020-04-29 66 4 80 79.0
  30. 25 2020-04-15 66 2 79 79.0
  31. 27 2020-02-26 66 3 79 70.0
  32. 29 2020-02-12 66 5 79 70.0
  33. 31 2020-01-29 66 1 70 NaN

答案3

得分: 1

将排名大于3的分数进行屏蔽,然后按学生ID分组,向后填充以传播最后一个预测分数

  1. c = 'Recent_Predicted_Score'
  2. df[c] = df['Predicted_Score'].mask(df['Rank'] > 3)
  3. df[c] = df.groupby('Student_ID')[c].apply(lambda s: s.shift(-1).bfill())

结果

  1. Date Student_ID Rank Predicted_Score Recent_Predicted_Score
  2. 0 4/7/2021 33 2 87 86.0
  3. 1 13/6/2021 33 4 88 86.0
  4. 2 31/3/2021 33 7 88 86.0
  5. 3 28/2/2021 33 2 86 81.0
  6. 4 14/2/2021 33 10 86 81.0
  7. 5 31/1/2021 33 8 86 81.0
  8. 6 23/12/2020 33 1 81 80.0
  9. 7 8/11/2020 33 3 80 80.0
  10. 8 21/10/2020 33 3 80 80.0
  11. 9 23/9/2020 33 4 80 80.0
  12. 10 20/5/2020 33 3 80 79.0
  13. 11 29/4/2020 33 4 80 79.0
  14. 12 15/4/2020 33 2 79 79.0
  15. 13 26/2/2020 33 3 79 70.0
  16. 14 12/2/2020 33 5 79 70.0
  17. 15 29/1/2020 33 1 70 NaN

注意:确保你的数据帧按日期降序排序。

英文:

Mask the scores where rank is greater than 3 then group the masked column by Student_ID and backward fill to propagate the last predicted score

  1. c = &#39;Recent_Predicted_Score&#39;
  2. df[c] = df[&#39;Predicted_Score&#39;].mask(df[&#39;Rank&#39;].gt(3))
  3. df[c] = df.groupby(&#39;Student_ID&#39;)[c].apply(lambda s: s.shift(-1).bfill())

Result

  1. Date Student_ID Rank Predicted_Score Recent_Predicted_Score
  2. 0 4/7/2021 33 2 87 86.0
  3. 1 13/6/2021 33 4 88 86.0
  4. 2 31/3/2021 33 7 88 86.0
  5. 3 28/2/2021 33 2 86 81.0
  6. 4 14/2/2021 33 10 86 81.0
  7. 5 31/1/2021 33 8 86 81.0
  8. 6 23/12/2020 33 1 81 80.0
  9. 7 8/11/2020 33 3 80 80.0
  10. 8 21/10/2020 33 3 80 80.0
  11. 9 23/9/2020 33 4 80 80.0
  12. 10 20/5/2020 33 3 80 79.0
  13. 11 29/4/2020 33 4 80 79.0
  14. 12 15/4/2020 33 2 79 79.0
  15. 13 26/2/2020 33 3 79 70.0
  16. 14 12/2/2020 33 5 79 70.0
  17. 15 29/1/2020 33 1 70 NaN

Note: Make sure your dataframe is sorted on Date in descending order.

huangapple
  • 本文由 发表于 2023年2月18日 22:23:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75493970.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定