2023年2月18日 22:23:09go评论96阅读模式

英文:

New column based on last time row value equals some numbers in Pandas dataframe

问题

我已经将您提供的内容翻译成中文，以下是代码部分的中文翻译：

# 创建一个名为Recent_Predicted_Score的列，记录学生在前3名中的最近预测分数
df.sort_values(by=['Student_ID', 'Date'], ascending=[True, False], inplace=True)
lp1 = df['Predicted_Score'].where(df['Rank'].isin([1, 2, 3])).groupby(df['Student_ID']).bfill()
lp2 = df.groupby(['Student_ID', 'Rank'])['Predicted_Score'].shift(-1)
df = df.assign(Recent_Predicted_Score=lp1.mask(df['Rank'].isin([1, 2, 3]), lp2))

请注意，上述代码是用于创建"Recent_Predicted_Score"列的Python代码，用于记录学生在前三名中的最近预测分数。如果您有任何其他问题或需要进一步的帮助，请随时告诉我。

英文:

I have a dataframe sorted in descending order date that records the Rank of students in class and the predicted score.

Date	    Student_ID	Rank	Predicted_Score
4/7/2021	33			2		87
13/6/2021	33			4		88
31/3/2021	33			7		88
28/2/2021	33			2		86
14/2/2021	33			10		86
31/1/2021	33			8		86
23/12/2020	33			1		81
8/11/2020	33			3		80
21/10/2020	33			3		80
23/9/2020	33			4		80
20/5/2020	33			3		80
29/4/2020	33			4		80
15/4/2020	33			2		79
26/2/2020	33			3		79
12/2/2020	33			5		79
29/1/2020	33			1		70

I want to create a column called Recent_Predicted_Score that record the last predicted_score where that student actually ranks top 3. So the desired outcome looks like

Date	    Student_ID	Rank	Predicted_Score	   Recent_Predicted_Score
4/7/2021	33			2		87				   86
13/6/2021	33			4		88				   86
31/3/2021	33			7		88				   86
28/2/2021	33			2		86				   81
14/2/2021	33			10		86				   81
31/1/2021	33			8		86				   81
23/12/2020	33			1		81				   80
8/11/2020	33			3		80				   80
21/10/2020	33			3		80				   80
23/9/2020	33			4		80				   80
20/5/2020	33			3		80				   79
29/4/2020	33			4		80				   79
15/4/2020	33			2		79				   79
26/2/2020	33			3		79				   70
12/2/2020	33			5		79				   70
29/1/2020	33			1		70

Here's what I have tried but it doesn't quite work, not sure if I am on the right track:

df.sort_values(by = [&#39;Student_ID&#39;, &#39;Date&#39;], ascending = [True, False], inplace = True)
lp1 = df[&#39;Predicted_Score&#39;].where(df[&#39;Rank&#39;].isin([1,2,3])).groupby(df[&#39;Student_ID&#39;]).bfill()
lp2 = df.groupby([&#39;Student_ID&#39;, &#39;Rank&#39;])[&#39;Predicted_Score&#39;].shift(-1)
df = df.assign(Recent_Predicted_Score=lp1.mask(df[&#39;Rank&#39;].isin([1,2,3]), lp2))

Thanks in advance.

答案1

得分: 2

尝试：

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values(['Student_ID', 'Date'])
df['Recent_Predicted_Score'] = np.where(df['Rank'].isin([1, 2, 3]), df['Predicted_Score'], np.nan)
df['Recent_Predicted_Score'] = df.groupby('Student_ID', group_keys=False)['Recent_Predicted_Score'].apply(lambda x: x.ffill().shift().fillna(''))
df = df.sort_values(['Student_ID', 'Date'], ascending=[True, False])
print(df)

打印结果：

         Date  Student_ID  Rank  Predicted_Score Recent_Predicted_Score
0  2021-07-04          33     2               87                   86.0
1  2021-06-13          33     4               88                   86.0
2  2021-03-31          33     7               88                   86.0
3  2021-02-28          33     2               86                   81.0
4  2021-02-14          33    10               86                   81.0
5  2021-01-31          33     8               86                   81.0
6  2020-12-23          33     1               81                   80.0
7  2020-11-08          33     3               80                   80.0
8  2020-10-21          33     3               80                   80.0
9  2020-09-23          33     4               80                   80.0
10 2020-05-20          33     3               80                   79.0
11 2020-04-29          33     4               80                   79.0
12 2020-04-15          33     2               79                   79.0
13 2020-02-26          33     3               79                   70.0
14 2020-02-12          33     5               79                   70.0
15 2020-01-29          33     1               70

英文:

Try:

df[&#39;Date&#39;] = pd.to_datetime(df[&#39;Date&#39;], dayfirst=True)
df = df.sort_values([&#39;Student_ID&#39;, &#39;Date&#39;])
df[&#39;Recent_Predicted_Score&#39;] = np.where(df[&#39;Rank&#39;].isin([1, 2, 3]), df[&#39;Predicted_Score&#39;], np.nan)
df[&#39;Recent_Predicted_Score&#39;] = df.groupby(&#39;Student_ID&#39;, group_keys=False)[&#39;Recent_Predicted_Score&#39;].apply(lambda x: x.ffill().shift().fillna(&#39;&#39;))
df = df.sort_values([&#39;Student_ID&#39;, &#39;Date&#39;], ascending = [True, False])
print(df)

Prints:

         Date  Student_ID  Rank  Predicted_Score Recent_Predicted_Score
0  2021-07-04          33     2               87                   86.0
1  2021-06-13          33     4               88                   86.0
2  2021-03-31          33     7               88                   86.0
3  2021-02-28          33     2               86                   81.0
4  2021-02-14          33    10               86                   81.0
5  2021-01-31          33     8               86                   81.0
6  2020-12-23          33     1               81                   80.0
7  2020-11-08          33     3               80                   80.0
8  2020-10-21          33     3               80                   80.0
9  2020-09-23          33     4               80                   80.0
10 2020-05-20          33     3               80                   79.0
11 2020-04-29          33     4               80                   79.0
12 2020-04-15          33     2               79                   79.0
13 2020-02-26          33     3               79                   70.0
14 2020-02-12          33     5               79                   70.0
15 2020-01-29          33     1               70

答案2

得分: 1

让我们假设：

可能存在多个唯一的 Student_ID
行按照 OP 指定的降序 Date 排序，但不一定按 Student_ID 排序
我们希望保留原始数据框的索引

在满足这些假设的情况下，以下是实现您问题要求的方法：

df['Recent_Predicted_Score'] = df.loc[df['Rank'] <= 3, 'Predicted_Score']
df['Recent_Predicted_Score'] = ( df
    .groupby('Student_ID', sort=False)
    .apply(lambda group: group['Predicted_Score'].shift(-1).bfill())
    ['Recent_Predicted_Score'] )

解释：

创建一个新列 Recent_Predicted_Score，其中包含 Rank 在前三名的行的 Predicted_Score，其他行为 NaN
使用 groupby() 根据 Student_ID 分组，将 sort 参数设置为 False 以提高性能（注意，groupby() 保留每个组内行的顺序，具体而言，不影响现有的按 Date 降序排序的顺序）
在每个组内，使用 shift(-1) 和 bfill() 来获得 Recent_Predicted_Score 的期望结果。

以下是示例输入（包含两个不同的 Student_ID 值）和输出，已根据 Student_ID, Date 进行排序以便更轻松检查：

（示例输入和输出数据已省略）

希望这对您有所帮助。

英文:

Let's assume:

there may be more than one unique Student_ID
the rows are ordered by descending Date as indicated by OP, but may not be ordered by Student_ID
we want to preserve the index of the original dataframe

Subject to these assumptions, here's a way to do what your question asks:

df[&#39;Recent_Predicted_Score&#39;] = df.loc[df.Rank &lt;= 3, &#39;Predicted_Score&#39;]
df[&#39;Recent_Predicted_Score&#39;] = ( df
    .groupby(&#39;Student_ID&#39;, sort=False)
    .apply(lambda group: group.shift(-1).bfill())
    [&#39;Recent_Predicted_Score&#39;] )

Explanation:

create a new column Recent_Predicted_Score containing the PredictedScore where Rank is in the top 3 and NaN otherwise
use groupby() on Student_ID with the sort argument set to False for better performance (note that groupby() preserves the order of rows within each group, specifically, not influencing the existing descending order by Date)
within each group, do shift(-1) and bfill() to get the desired result for Recent_Predicted_Score.

Sample input (with two distinct Student_ID values):

         Date  Student_ID  Rank  Predicted_Score
0  2021-07-04          33     2               87
1  2021-07-04          66     2               87
2  2021-06-13          33     4               88
3  2021-06-13          66     4               88
4  2021-03-31          33     7               88
5  2021-03-31          66     7               88
6  2021-02-28          33     2               86
7  2021-02-28          66     2               86
8  2021-02-14          33    10               86
9  2021-02-14          66    10               86
10 2021-01-31          33     8               86
11 2021-01-31          66     8               86
12 2020-12-23          33     1               81
13 2020-12-23          66     1               81
14 2020-11-08          33     3               80
15 2020-11-08          66     3               80
16 2020-10-21          33     3               80
17 2020-10-21          66     3               80
18 2020-09-23          33     4               80
19 2020-09-23          66     4               80
20 2020-05-20          33     3               80
21 2020-05-20          66     3               80
22 2020-04-29          33     4               80
23 2020-04-29          66     4               80
24 2020-04-15          33     2               79
25 2020-04-15          66     2               79
26 2020-02-26          33     3               79
27 2020-02-26          66     3               79
28 2020-02-12          33     5               79
29 2020-02-12          66     5               79
30 2020-01-29          33     1               70
31 2020-01-29          66     1               70

Output:

         Date  Student_ID  Rank  Predicted_Score  Recent_Predicted_Score
0  2021-07-04          33     2               87                    86.0
1  2021-07-04          66     2               87                    86.0
2  2021-06-13          33     4               88                    86.0
3  2021-06-13          66     4               88                    86.0
4  2021-03-31          33     7               88                    86.0
5  2021-03-31          66     7               88                    86.0
6  2021-02-28          33     2               86                    81.0
7  2021-02-28          66     2               86                    81.0
8  2021-02-14          33    10               86                    81.0
9  2021-02-14          66    10               86                    81.0
10 2021-01-31          33     8               86                    81.0
11 2021-01-31          66     8               86                    81.0
12 2020-12-23          33     1               81                    80.0
13 2020-12-23          66     1               81                    80.0
14 2020-11-08          33     3               80                    80.0
15 2020-11-08          66     3               80                    80.0
16 2020-10-21          33     3               80                    80.0
17 2020-10-21          66     3               80                    80.0
18 2020-09-23          33     4               80                    80.0
19 2020-09-23          66     4               80                    80.0
20 2020-05-20          33     3               80                    79.0
21 2020-05-20          66     3               80                    79.0
22 2020-04-29          33     4               80                    79.0
23 2020-04-29          66     4               80                    79.0
24 2020-04-15          33     2               79                    79.0
25 2020-04-15          66     2               79                    79.0
26 2020-02-26          33     3               79                    70.0
27 2020-02-26          66     3               79                    70.0
28 2020-02-12          33     5               79                    70.0
29 2020-02-12          66     5               79                    70.0
30 2020-01-29          33     1               70                     NaN
31 2020-01-29          66     1               70                     NaN

Output sorted by Student_ID, Date for easier inspection:

         Date  Student_ID  Rank  Predicted_Score  Recent_Predicted_Score
0  2021-07-04          33     2               87                    86.0
2  2021-06-13          33     4               88                    86.0
4  2021-03-31          33     7               88                    86.0
6  2021-02-28          33     2               86                    81.0
8  2021-02-14          33    10               86                    81.0
10 2021-01-31          33     8               86                    81.0
12 2020-12-23          33     1               81                    80.0
14 2020-11-08          33     3               80                    80.0
16 2020-10-21          33     3               80                    80.0
18 2020-09-23          33     4               80                    80.0
20 2020-05-20          33     3               80                    79.0
22 2020-04-29          33     4               80                    79.0
24 2020-04-15          33     2               79                    79.0
26 2020-02-26          33     3               79                    70.0
28 2020-02-12          33     5               79                    70.0
30 2020-01-29          33     1               70                     NaN
1  2021-07-04          66     2               87                    86.0
3  2021-06-13          66     4               88                    86.0
5  2021-03-31          66     7               88                    86.0
7  2021-02-28          66     2               86                    81.0
9  2021-02-14          66    10               86                    81.0
11 2021-01-31          66     8               86                    81.0
13 2020-12-23          66     1               81                    80.0
15 2020-11-08          66     3               80                    80.0
17 2020-10-21          66     3               80                    80.0
19 2020-09-23          66     4               80                    80.0
21 2020-05-20          66     3               80                    79.0
23 2020-04-29          66     4               80                    79.0
25 2020-04-15          66     2               79                    79.0
27 2020-02-26          66     3               79                    70.0
29 2020-02-12          66     5               79                    70.0
31 2020-01-29          66     1               70                     NaN

答案3

得分: 1

将排名大于3的分数进行屏蔽，然后按学生ID分组，向后填充以传播最后一个预测分数

c = 'Recent_Predicted_Score'
df[c] = df['Predicted_Score'].mask(df['Rank'] > 3)
df[c] = df.groupby('Student_ID')[c].apply(lambda s: s.shift(-1).bfill())

结果

              Date  Student_ID  Rank  Predicted_Score  Recent_Predicted_Score
    0     4/7/2021          33     2               87                    86.0
    1    13/6/2021          33     4               88                    86.0
    2    31/3/2021          33     7               88                    86.0
    3    28/2/2021          33     2               86                    81.0
    4    14/2/2021          33    10               86                    81.0
    5    31/1/2021          33     8               86                    81.0
    6   23/12/2020          33     1               81                    80.0
    7    8/11/2020          33     3               80                    80.0
    8   21/10/2020          33     3               80                    80.0
    9    23/9/2020          33     4               80                    80.0
    10   20/5/2020          33     3               80                    79.0
    11   29/4/2020          33     4               80                    79.0
    12   15/4/2020          33     2               79                    79.0
    13   26/2/2020          33     3               79                    70.0
    14   12/2/2020          33     5               79                    70.0
    15   29/1/2020          33     1               70                     NaN

注意：确保你的数据帧按日期降序排序。

英文:

Mask the scores where rank is greater than 3 then group the masked column by Student_ID and backward fill to propagate the last predicted score

c = &#39;Recent_Predicted_Score&#39;
df[c] = df[&#39;Predicted_Score&#39;].mask(df[&#39;Rank&#39;].gt(3))
df[c] = df.groupby(&#39;Student_ID&#39;)[c].apply(lambda s: s.shift(-1).bfill())

Result

          Date  Student_ID  Rank  Predicted_Score  Recent_Predicted_Score
0     4/7/2021          33     2               87                    86.0
1    13/6/2021          33     4               88                    86.0
2    31/3/2021          33     7               88                    86.0
3    28/2/2021          33     2               86                    81.0
4    14/2/2021          33    10               86                    81.0
5    31/1/2021          33     8               86                    81.0
6   23/12/2020          33     1               81                    80.0
7    8/11/2020          33     3               80                    80.0
8   21/10/2020          33     3               80                    80.0
9    23/9/2020          33     4               80                    80.0
10   20/5/2020          33     3               80                    79.0
11   29/4/2020          33     4               80                    79.0
12   15/4/2020          33     2               79                    79.0
13   26/2/2020          33     3               79                    70.0
14   12/2/2020          33     5               79                    70.0
15   29/1/2020          33     1               70                     NaN

Note: Make sure your dataframe is sorted on Date in descending order.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

基于 Pandas 数据框中上次行的数值创建新列等于某些数字。

问题

答案1

答案2

答案3

该函数未返回任何值。

在Python中计算开始时间

如何在Azure-ML中使用Blob存储账户训练YOLO-NAS模型。

好的，以下是翻译好的内容：在VScode中查看矩阵和更高维数组的好方法。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。