基于 Pandas 数据框中上次行的数值创建新列等于某些数字。

huangapple go评论66阅读模式
英文:

New column based on last time row value equals some numbers in Pandas dataframe

问题

我已经将您提供的内容翻译成中文,以下是代码部分的中文翻译:

# 创建一个名为Recent_Predicted_Score的列,记录学生在前3名中的最近预测分数
df.sort_values(by=['Student_ID', 'Date'], ascending=[True, False], inplace=True)
lp1 = df['Predicted_Score'].where(df['Rank'].isin([1, 2, 3])).groupby(df['Student_ID']).bfill()
lp2 = df.groupby(['Student_ID', 'Rank'])['Predicted_Score'].shift(-1)
df = df.assign(Recent_Predicted_Score=lp1.mask(df['Rank'].isin([1, 2, 3]), lp2))

请注意,上述代码是用于创建"Recent_Predicted_Score"列的Python代码,用于记录学生在前三名中的最近预测分数。如果您有任何其他问题或需要进一步的帮助,请随时告诉我。

英文:

I have a dataframe sorted in descending order date that records the Rank of students in class and the predicted score.

Date	    Student_ID	Rank	Predicted_Score
4/7/2021	33			2		87
13/6/2021	33			4		88
31/3/2021	33			7		88
28/2/2021	33			2		86
14/2/2021	33			10		86
31/1/2021	33			8		86
23/12/2020	33			1		81
8/11/2020	33			3		80
21/10/2020	33			3		80
23/9/2020	33			4		80
20/5/2020	33			3		80
29/4/2020	33			4		80
15/4/2020	33			2		79
26/2/2020	33			3		79
12/2/2020	33			5		79
29/1/2020	33			1		70

I want to create a column called Recent_Predicted_Score that record the last predicted_score where that student actually ranks top 3. So the desired outcome looks like

Date	    Student_ID	Rank	Predicted_Score	   Recent_Predicted_Score
4/7/2021	33			2		87				   86
13/6/2021	33			4		88				   86
31/3/2021	33			7		88				   86
28/2/2021	33			2		86				   81
14/2/2021	33			10		86				   81
31/1/2021	33			8		86				   81
23/12/2020	33			1		81				   80
8/11/2020	33			3		80				   80
21/10/2020	33			3		80				   80
23/9/2020	33			4		80				   80
20/5/2020	33			3		80				   79
29/4/2020	33			4		80				   79
15/4/2020	33			2		79				   79
26/2/2020	33			3		79				   70
12/2/2020	33			5		79				   70
29/1/2020	33			1		70

Here's what I have tried but it doesn't quite work, not sure if I am on the right track:

df.sort_values(by = ['Student_ID', 'Date'], ascending = [True, False], inplace = True)
lp1 = df['Predicted_Score'].where(df['Rank'].isin([1,2,3])).groupby(df['Student_ID']).bfill()
lp2 = df.groupby(['Student_ID', 'Rank'])['Predicted_Score'].shift(-1)
df = df.assign(Recent_Predicted_Score=lp1.mask(df['Rank'].isin([1,2,3]), lp2))

Thanks in advance.

答案1

得分: 2

尝试:

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

df = df.sort_values(['Student_ID', 'Date'])

df['Recent_Predicted_Score'] = np.where(df['Rank'].isin([1, 2, 3]), df['Predicted_Score'], np.nan)
df['Recent_Predicted_Score'] = df.groupby('Student_ID', group_keys=False)['Recent_Predicted_Score'].apply(lambda x: x.ffill().shift().fillna(''))

df = df.sort_values(['Student_ID', 'Date'], ascending=[True, False])
print(df)

打印结果:

         Date  Student_ID  Rank  Predicted_Score Recent_Predicted_Score
0  2021-07-04          33     2               87                   86.0
1  2021-06-13          33     4               88                   86.0
2  2021-03-31          33     7               88                   86.0
3  2021-02-28          33     2               86                   81.0
4  2021-02-14          33    10               86                   81.0
5  2021-01-31          33     8               86                   81.0
6  2020-12-23          33     1               81                   80.0
7  2020-11-08          33     3               80                   80.0
8  2020-10-21          33     3               80                   80.0
9  2020-09-23          33     4               80                   80.0
10 2020-05-20          33     3               80                   79.0
11 2020-04-29          33     4               80                   79.0
12 2020-04-15          33     2               79                   79.0
13 2020-02-26          33     3               79                   70.0
14 2020-02-12          33     5               79                   70.0
15 2020-01-29          33     1               70                       
英文:

Try:

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

df = df.sort_values(['Student_ID', 'Date'])

df['Recent_Predicted_Score'] = np.where(df['Rank'].isin([1, 2, 3]), df['Predicted_Score'], np.nan)
df['Recent_Predicted_Score'] = df.groupby('Student_ID', group_keys=False)['Recent_Predicted_Score'].apply(lambda x: x.ffill().shift().fillna(''))

df = df.sort_values(['Student_ID', 'Date'], ascending = [True, False])
print(df)

Prints:

         Date  Student_ID  Rank  Predicted_Score Recent_Predicted_Score
0  2021-07-04          33     2               87                   86.0
1  2021-06-13          33     4               88                   86.0
2  2021-03-31          33     7               88                   86.0
3  2021-02-28          33     2               86                   81.0
4  2021-02-14          33    10               86                   81.0
5  2021-01-31          33     8               86                   81.0
6  2020-12-23          33     1               81                   80.0
7  2020-11-08          33     3               80                   80.0
8  2020-10-21          33     3               80                   80.0
9  2020-09-23          33     4               80                   80.0
10 2020-05-20          33     3               80                   79.0
11 2020-04-29          33     4               80                   79.0
12 2020-04-15          33     2               79                   79.0
13 2020-02-26          33     3               79                   70.0
14 2020-02-12          33     5               79                   70.0
15 2020-01-29          33     1               70                       

答案2

得分: 1

让我们假设:

  • 可能存在多个唯一的 Student_ID
  • 行按照 OP 指定的降序 Date 排序,但不一定按 Student_ID 排序
  • 我们希望保留原始数据框的索引

在满足这些假设的情况下,以下是实现您问题要求的方法:

df['Recent_Predicted_Score'] = df.loc[df['Rank'] <= 3, 'Predicted_Score']
df['Recent_Predicted_Score'] = ( df
    .groupby('Student_ID', sort=False)
    .apply(lambda group: group['Predicted_Score'].shift(-1).bfill())
    ['Recent_Predicted_Score'] )

解释:

  • 创建一个新列 Recent_Predicted_Score,其中包含 Rank 在前三名的行的 Predicted_Score,其他行为 NaN
  • 使用 groupby() 根据 Student_ID 分组,将 sort 参数设置为 False 以提高性能(注意,groupby() 保留每个组内行的顺序,具体而言,不影响现有的按 Date 降序排序的顺序)
  • 在每个组内,使用 shift(-1)bfill() 来获得 Recent_Predicted_Score 的期望结果。

以下是示例输入(包含两个不同的 Student_ID 值)和输出,已根据 Student_ID, Date 进行排序以便更轻松检查:

(示例输入和输出数据已省略)

希望这对您有所帮助。

英文:

Let's assume:

  • there may be more than one unique Student_ID
  • the rows are ordered by descending Date as indicated by OP, but may not be ordered by Student_ID
  • we want to preserve the index of the original dataframe

Subject to these assumptions, here's a way to do what your question asks:

df[&#39;Recent_Predicted_Score&#39;] = df.loc[df.Rank &lt;= 3, &#39;Predicted_Score&#39;]
df[&#39;Recent_Predicted_Score&#39;] = ( df
    .groupby(&#39;Student_ID&#39;, sort=False)
    .apply(lambda group: group.shift(-1).bfill())
    [&#39;Recent_Predicted_Score&#39;] )

Explanation:

  • create a new column Recent_Predicted_Score containing the PredictedScore where Rank is in the top 3 and NaN otherwise
  • use groupby() on Student_ID with the sort argument set to False for better performance (note that groupby() preserves the order of rows within each group, specifically, not influencing the existing descending order by Date)
  • within each group, do shift(-1) and bfill() to get the desired result for Recent_Predicted_Score.

Sample input (with two distinct Student_ID values):

         Date  Student_ID  Rank  Predicted_Score
0  2021-07-04          33     2               87
1  2021-07-04          66     2               87
2  2021-06-13          33     4               88
3  2021-06-13          66     4               88
4  2021-03-31          33     7               88
5  2021-03-31          66     7               88
6  2021-02-28          33     2               86
7  2021-02-28          66     2               86
8  2021-02-14          33    10               86
9  2021-02-14          66    10               86
10 2021-01-31          33     8               86
11 2021-01-31          66     8               86
12 2020-12-23          33     1               81
13 2020-12-23          66     1               81
14 2020-11-08          33     3               80
15 2020-11-08          66     3               80
16 2020-10-21          33     3               80
17 2020-10-21          66     3               80
18 2020-09-23          33     4               80
19 2020-09-23          66     4               80
20 2020-05-20          33     3               80
21 2020-05-20          66     3               80
22 2020-04-29          33     4               80
23 2020-04-29          66     4               80
24 2020-04-15          33     2               79
25 2020-04-15          66     2               79
26 2020-02-26          33     3               79
27 2020-02-26          66     3               79
28 2020-02-12          33     5               79
29 2020-02-12          66     5               79
30 2020-01-29          33     1               70
31 2020-01-29          66     1               70

Output:

         Date  Student_ID  Rank  Predicted_Score  Recent_Predicted_Score
0  2021-07-04          33     2               87                    86.0
1  2021-07-04          66     2               87                    86.0
2  2021-06-13          33     4               88                    86.0
3  2021-06-13          66     4               88                    86.0
4  2021-03-31          33     7               88                    86.0
5  2021-03-31          66     7               88                    86.0
6  2021-02-28          33     2               86                    81.0
7  2021-02-28          66     2               86                    81.0
8  2021-02-14          33    10               86                    81.0
9  2021-02-14          66    10               86                    81.0
10 2021-01-31          33     8               86                    81.0
11 2021-01-31          66     8               86                    81.0
12 2020-12-23          33     1               81                    80.0
13 2020-12-23          66     1               81                    80.0
14 2020-11-08          33     3               80                    80.0
15 2020-11-08          66     3               80                    80.0
16 2020-10-21          33     3               80                    80.0
17 2020-10-21          66     3               80                    80.0
18 2020-09-23          33     4               80                    80.0
19 2020-09-23          66     4               80                    80.0
20 2020-05-20          33     3               80                    79.0
21 2020-05-20          66     3               80                    79.0
22 2020-04-29          33     4               80                    79.0
23 2020-04-29          66     4               80                    79.0
24 2020-04-15          33     2               79                    79.0
25 2020-04-15          66     2               79                    79.0
26 2020-02-26          33     3               79                    70.0
27 2020-02-26          66     3               79                    70.0
28 2020-02-12          33     5               79                    70.0
29 2020-02-12          66     5               79                    70.0
30 2020-01-29          33     1               70                     NaN
31 2020-01-29          66     1               70                     NaN

Output sorted by Student_ID, Date for easier inspection:

         Date  Student_ID  Rank  Predicted_Score  Recent_Predicted_Score
0  2021-07-04          33     2               87                    86.0
2  2021-06-13          33     4               88                    86.0
4  2021-03-31          33     7               88                    86.0
6  2021-02-28          33     2               86                    81.0
8  2021-02-14          33    10               86                    81.0
10 2021-01-31          33     8               86                    81.0
12 2020-12-23          33     1               81                    80.0
14 2020-11-08          33     3               80                    80.0
16 2020-10-21          33     3               80                    80.0
18 2020-09-23          33     4               80                    80.0
20 2020-05-20          33     3               80                    79.0
22 2020-04-29          33     4               80                    79.0
24 2020-04-15          33     2               79                    79.0
26 2020-02-26          33     3               79                    70.0
28 2020-02-12          33     5               79                    70.0
30 2020-01-29          33     1               70                     NaN
1  2021-07-04          66     2               87                    86.0
3  2021-06-13          66     4               88                    86.0
5  2021-03-31          66     7               88                    86.0
7  2021-02-28          66     2               86                    81.0
9  2021-02-14          66    10               86                    81.0
11 2021-01-31          66     8               86                    81.0
13 2020-12-23          66     1               81                    80.0
15 2020-11-08          66     3               80                    80.0
17 2020-10-21          66     3               80                    80.0
19 2020-09-23          66     4               80                    80.0
21 2020-05-20          66     3               80                    79.0
23 2020-04-29          66     4               80                    79.0
25 2020-04-15          66     2               79                    79.0
27 2020-02-26          66     3               79                    70.0
29 2020-02-12          66     5               79                    70.0
31 2020-01-29          66     1               70                     NaN

答案3

得分: 1

将排名大于3的分数进行屏蔽,然后按学生ID分组,向后填充以传播最后一个预测分数

c = 'Recent_Predicted_Score'
df[c] = df['Predicted_Score'].mask(df['Rank'] > 3)
df[c] = df.groupby('Student_ID')[c].apply(lambda s: s.shift(-1).bfill())

结果

              Date  Student_ID  Rank  Predicted_Score  Recent_Predicted_Score
    0     4/7/2021          33     2               87                    86.0
    1    13/6/2021          33     4               88                    86.0
    2    31/3/2021          33     7               88                    86.0
    3    28/2/2021          33     2               86                    81.0
    4    14/2/2021          33    10               86                    81.0
    5    31/1/2021          33     8               86                    81.0
    6   23/12/2020          33     1               81                    80.0
    7    8/11/2020          33     3               80                    80.0
    8   21/10/2020          33     3               80                    80.0
    9    23/9/2020          33     4               80                    80.0
    10   20/5/2020          33     3               80                    79.0
    11   29/4/2020          33     4               80                    79.0
    12   15/4/2020          33     2               79                    79.0
    13   26/2/2020          33     3               79                    70.0
    14   12/2/2020          33     5               79                    70.0
    15   29/1/2020          33     1               70                     NaN

注意:确保你的数据帧按日期降序排序。

英文:

Mask the scores where rank is greater than 3 then group the masked column by Student_ID and backward fill to propagate the last predicted score

c = &#39;Recent_Predicted_Score&#39;
df[c] = df[&#39;Predicted_Score&#39;].mask(df[&#39;Rank&#39;].gt(3))
df[c] = df.groupby(&#39;Student_ID&#39;)[c].apply(lambda s: s.shift(-1).bfill())

Result

          Date  Student_ID  Rank  Predicted_Score  Recent_Predicted_Score
0     4/7/2021          33     2               87                    86.0
1    13/6/2021          33     4               88                    86.0
2    31/3/2021          33     7               88                    86.0
3    28/2/2021          33     2               86                    81.0
4    14/2/2021          33    10               86                    81.0
5    31/1/2021          33     8               86                    81.0
6   23/12/2020          33     1               81                    80.0
7    8/11/2020          33     3               80                    80.0
8   21/10/2020          33     3               80                    80.0
9    23/9/2020          33     4               80                    80.0
10   20/5/2020          33     3               80                    79.0
11   29/4/2020          33     4               80                    79.0
12   15/4/2020          33     2               79                    79.0
13   26/2/2020          33     3               79                    70.0
14   12/2/2020          33     5               79                    70.0
15   29/1/2020          33     1               70                     NaN

Note: Make sure your dataframe is sorted on Date in descending order.

huangapple
  • 本文由 发表于 2023年2月18日 22:23:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75493970.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定