检索符合条件的最后一行数据。

huangapple go评论73阅读模式
英文:

Retrieve last row of data with conditions

问题

我有以下的大型数据集,记录了数学竞赛结果,按日期降序排列:例如,学生1在比赛1中获得第三名,而学生3在比赛2中获得第一名,依此类推。

Race_ID   Date           Student_ID      Rank  
21         1/1/2023       1               3     
21         1/1/2023       2               2     
21         1/1/2023       3               1     
21         1/1/2023       4               4     
25         11/9/2022      1               2     
25         11/9/2022      2               3     
25         11/9/2022      3               1     
3          17/4/2022      5               4     
3          17/4/2022      2               1     
3          17/4/2022      3               2     
3          17/4/2022      4               3     
14         1/3/2022       1               1     
14         1/3/2022       2               2     
85         1/1/2021       1               2     
85         1/1/2021       2               3     
85         1/1/2021       3               1     

我想创建一个名为```Last_win```的新列,它返回学生上次获胜的```Race_ID```(即排名第一的比赛)。所以结果应该如下:

Race_ID   Date           Student_ID      Rank    Last_win
21         1/1/2023       1               3       14
21         1/1/2023       2               2       3
21         1/1/2023       3               1       25
21         1/1/2023       4               4       NaN
25         11/9/2022      1               2       14
25         11/9/2022      2               3       3
25         11/9/2022      3               1       85
3          17/4/2022      5               4       NaN
3          17/4/2022      2               1       NaN
3          17/4/2022      3               2       85
3          17/4/2022      4               3       NaN
14         1/3/2022       1               1       NaN
14         1/3/2022       2               2       NaN
85         1/1/2021       1               2       NaN
85         1/1/2021       2               3       NaN
85         1/1/2021       3               1       NaN

非常感谢您的提前帮助。
英文:

I have the following large dataset recording the result of a math competition among students in descending order of date: So for example, student 1 comes third in Race 1 while student 3 won Race 2, etc.

Race_ID   Date           Student_ID      Rank  
21         1/1/2023       1               3     
21         1/1/2023       2               2     
21         1/1/2023       3               1     
21         1/1/2023       4               4     
25         11/9/2022      1               2     
25         11/9/2022      2               3     
25         11/9/2022      3               1     
3          17/4/2022      5               4     
3          17/4/2022      2               1     
3          17/4/2022      3               2     
3          17/4/2022      4               3     
14         1/3/2022       1               1     
14         1/3/2022       2               2     
85         1/1/2021       1               2     
85         1/1/2021       2               3     
85         1/1/2021       3               1     

And I want to create a new column called Last_win which returns the Race_ID of the last time that student won (i.e. rank number 1). So the outcome should look like

Race_ID   Date           Student_ID      Rank    Last_win
21         1/1/2023       1               3       14
21         1/1/2023       2               2       3
21         1/1/2023       3               1       25
21         1/1/2023       4               4       NaN
25         11/9/2022      1               2       14
25         11/9/2022      2               3       3
25         11/9/2022      3               1       85
3          17/4/2022      5               4       NaN
3          17/4/2022      2               1       NaN
3          17/4/2022      3               2       85
3          17/4/2022      4               3       NaN
14         1/3/2022       1               1       NaN
14         1/3/2022       2               2       NaN
85         1/1/2021       1               2       NaN
85         1/1/2021       2               3       NaN
85         1/1/2021       3               1       NaN

Thank you so much inadvacne.

答案1

得分: 2

保留Rank为1的行,隐藏其他行,然后按学生分组。最后,向后填充Race_ID,不要忘记移动结果:

# 如果需要,将Date列转换为DatetimeIndex
# df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

df['Last_win'] = (df.where(df['Rank'] == 1)
                    .groupby(df['Student_ID'])['Race_ID']
                    .transform(lambda x: x.bfill().shift(-1)))
print(df)

# 输出
    Race_ID       Date  Student_ID  Rank  Last_win
0        21 2023-01-01           1     3      14.0
1        21 2023-01-01           2     2       3.0
2        21 2023-01-01           3     1      25.0
3        21 2023-01-01           4     4       NaN
4        25 2022-09-11           1     2      14.0
5        25 2022-09-11           2     3       3.0
6        25 2022-09-11           3     1      85.0
7         3 2022-04-17           5     4       NaN
8         3 2022-04-17           2     1       NaN
9         3 2022-04-17           3     2      85.0
10        3 2022-04-17           4     3       NaN
11       14 2022-03-01           1     1       NaN
12       14 2022-03-01           2     2       NaN
13       85 2021-01-01           1     2       NaN
14       85 2021-01-01           2     3       NaN
15       85 2021-01-01           3     1       NaN
英文:

Keep rows where Rank is 1 and hide others then group by Student. Finally, fill backward Race_ID and don't forget to shift the result:

# Convert Date column to DatetimeIndex if needed
# df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['Last_win'] = (df.where(df['Rank'] == 1)
.groupby(df['Student_ID'])['Race_ID']
.transform(lambda x: x.bfill().shift(-1)))
print(df)
# Output
Race_ID       Date  Student_ID  Rank  Last_win
0        21 2023-01-01           1     3      14.0
1        21 2023-01-01           2     2       3.0
2        21 2023-01-01           3     1      25.0
3        21 2023-01-01           4     4       NaN
4        25 2022-09-11           1     2      14.0
5        25 2022-09-11           2     3       3.0
6        25 2022-09-11           3     1      85.0
7         3 2022-04-17           5     4       NaN
8         3 2022-04-17           2     1       NaN
9         3 2022-04-17           3     2      85.0
10        3 2022-04-17           4     3       NaN
11       14 2022-03-01           1     1       NaN
12       14 2022-03-01           2     2       NaN
13       85 2021-01-01           1     2       NaN
14       85 2021-01-01           2     3       NaN
15       85 2021-01-01           3     1       NaN

答案2

得分: 1

使用 merge_asof 来合并先前日期,通过参数 allow_exact_matches=False

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

df = df.sort_values('Date')

df1 = (df.loc[df['Rank'].eq(1), ['Date','Student_ID','Race_ID']]
         .rename(columns={'Race_ID':'Last_win'}))

df = (pd.merge_asof(df.reset_index(), 
                    df1, on='Date', allow_exact_matches=False, by='Student_ID')
        .sort_values('index', ignore_index=True)
        .drop('index', axis=1))

print(df)

输出结果如下:

    Race_ID       Date  Student_ID  Rank  Last_win
0        21 2023-01-01           1     3      14.0
1        21 2023-01-01           2     2       3.0
2        21 2023-01-01           3     1      25.0
3        21 2023-01-01           4     4       NaN
4        25 2022-09-11           1     2      14.0
5        25 2022-09-11           2     3       3.0
6        25 2022-09-11           3     1      85.0
7         3 2022-04-17           5     4       NaN
8         3 2022-04-17           2     1       NaN
9         3 2022-04-17           3     2      85.0
10        3 2022-04-17           4     3       NaN
11       14 2022-03-01           1     1       NaN
12       14 2022-03-01           2     2       NaN
13       85 2021-01-01           1     2       NaN
14       85 2021-01-01           2     3       NaN
15       85 2021-01-01           3     1       NaN
英文:

Use merge_asof for merge previous datetimes by parameter allow_exact_matches=False:

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values('Date')
df1 = (df.loc[df['Rank'].eq(1), ['Date','Student_ID','Race_ID']]
.rename(columns={'Race_ID':'Last_win'}))
df = (pd.merge_asof(df.reset_index(), 
df1, on='Date', allow_exact_matches=False, by='Student_ID')
.sort_values('index', ignore_index=True)
.drop('index', axis=1))
print (df)
Race_ID       Date  Student_ID  Rank  Last_win
0        21 2023-01-01           1     3      14.0
1        21 2023-01-01           2     2       3.0
2        21 2023-01-01           3     1      25.0
3        21 2023-01-01           4     4       NaN
4        25 2022-09-11           1     2      14.0
5        25 2022-09-11           2     3       3.0
6        25 2022-09-11           3     1      85.0
7         3 2022-04-17           5     4       NaN
8         3 2022-04-17           2     1       NaN
9         3 2022-04-17           3     2      85.0
10        3 2022-04-17           4     3       NaN
11       14 2022-03-01           1     1       NaN
12       14 2022-03-01           2     2       NaN
13       85 2021-01-01           1     2       NaN
14       85 2021-01-01           2     3       NaN
15       85 2021-01-01           3     1       NaN

huangapple
  • 本文由 发表于 2023年2月8日 18:02:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75384112.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定