如何在 pandas 中获取 groupby() 中的最后一条记录?

huangapple go评论78阅读模式
英文:

How do I get the last record in a groupby() in pandas?

问题

我有一个名为df的数据帧,其中包含每个学生的多条记录。我经常需要获取具有最后时间戳的记录。

要做到这一点,最好的方法是什么?以前我一直在使用last(),但这会返回最后一个非空值,而我实际上只想要最后一个值,无论是否为空。

使用apply(lambda r: r.iloc[-1])可以工作,但这段代码看起来不太美观(我不喜欢使用apply,而且从经验上看,它可能会感觉慢和低效,可能是因为apply的原因)。

应该如何正确地完成这个任务?

import pandas as pd
import numpy as np

# 创建示例DataFrame
df = pd.DataFrame([["A", 2, 3], ["B", 5, 6], ["A", np.NaN, 4]], columns=["student", "value_a", "timestamp"]).sort_values("timestamp")
df

# 按学生分组并获取最后一个记录
df.groupby("student").tail(1)

这将返回每个学生的最后一条记录,包括timestamp列中的最后一个时间戳。

英文:

I have a dataframe df which has a number of records for each student. Frequently I want to get the one with the last timestamp.

What is the best way to do this? Previously I had been using last() but this gives the last non null value when really I just want the last value, null or otherwise.

Using apply(lambda r: r.iloc[-1]) works, but the code feels ugly (I hate using an apply and anecdotally it feels slow and inefficient, likely because of the apply).

What is the right way to do this?

(Pdb) df = pd.DataFrame([["A",2,3],["B",5,6],["A",np.NaN,4]], columns=["student", "value_a", "timestamp"]).sort_values("timestamp")
(Pdb) df
  student  value_a  timestamp
0       A      2.0          3
2       A      NaN          4
1       B      5.0          6

(Pdb) df.groupby("student").last()
# This gives the wrong answer
         value_a  timestamp
student                    
A            2.0          4
B            5.0          6

(Pdb) df.groupby("student").apply(lambda r: r.iloc[-1])
# This gives the right answer but feels inefficient
        student  value_a  timestamp
student                            
A             A      NaN          4
B             B      5.0          6

答案1

得分: 5

一个选项是使用 groupby.tail

df.groupby('student').tail(1)

输出:

  student  value_a  timestamp
2       A      NaN          4
1       B      5.0          6

请注意,如果您想要最后一个时间戳,另一个选项是使用 groupby.idxmax 进行索引:

df.loc[df.groupby('student')['timestamp'].idxmax()]
英文:

One option is to use groupby.tail:

df.groupby('student').tail(1)

Output:

  student  value_a  timestamp
2       A      NaN          4
1       B      5.0          6

Note that if you want the last timestamp, another option is to index with groupby.idxmax:

df.loc[df.groupby('student')['timestamp'].idxmax()]

答案2

得分: 4

你可以尝试使用.nth

out = df.groupby('student').nth(-1)
print(out)

打印结果:

         value_a  timestamp
student                    
A            NaN          4
B            5.0          6
英文:

You can try .nth:

out = df.groupby('student').nth(-1)
print(out)

Prints:

         value_a  timestamp
student                    
A            NaN          4
B            5.0          6

答案3

得分: 4

你选择使用 nthtail 时,首先需要对数据框进行排序。之后可以删除重复项:

>>> df.sort_values('timestamp').drop_duplicates('student', keep='last')
  student  value_a  timestamp
2       A      NaN          4
1       B      5.0          6
英文:

You have to sort your dataframe first if you choose to use nth or tail. After that you can drop the dupes:

>>> df.sort_values('timestamp').drop_duplicates('student', keep='last')
  student  value_a  timestamp
2       A      NaN          4
1       B      5.0          6

huangapple
  • 本文由 发表于 2023年7月11日 03:34:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76656788.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定