英文:
How do I get the last record in a groupby() in pandas?
问题
我有一个名为df
的数据帧,其中包含每个学生的多条记录。我经常需要获取具有最后时间戳的记录。
要做到这一点,最好的方法是什么?以前我一直在使用last()
,但这会返回最后一个非空值,而我实际上只想要最后一个值,无论是否为空。
使用apply(lambda r: r.iloc[-1])
可以工作,但这段代码看起来不太美观(我不喜欢使用apply
,而且从经验上看,它可能会感觉慢和低效,可能是因为apply
的原因)。
应该如何正确地完成这个任务?
import pandas as pd
import numpy as np
# 创建示例DataFrame
df = pd.DataFrame([["A", 2, 3], ["B", 5, 6], ["A", np.NaN, 4]], columns=["student", "value_a", "timestamp"]).sort_values("timestamp")
df
# 按学生分组并获取最后一个记录
df.groupby("student").tail(1)
这将返回每个学生的最后一条记录,包括timestamp
列中的最后一个时间戳。
英文:
I have a dataframe df
which has a number of records for each student. Frequently I want to get the one with the last timestamp.
What is the best way to do this? Previously I had been using last()
but this gives the last non null value when really I just want the last value, null or otherwise.
Using apply(lambda r: r.iloc[-1])
works, but the code feels ugly (I hate using an apply
and anecdotally it feels slow and inefficient, likely because of the apply).
What is the right way to do this?
(Pdb) df = pd.DataFrame([["A",2,3],["B",5,6],["A",np.NaN,4]], columns=["student", "value_a", "timestamp"]).sort_values("timestamp")
(Pdb) df
student value_a timestamp
0 A 2.0 3
2 A NaN 4
1 B 5.0 6
(Pdb) df.groupby("student").last()
# This gives the wrong answer
value_a timestamp
student
A 2.0 4
B 5.0 6
(Pdb) df.groupby("student").apply(lambda r: r.iloc[-1])
# This gives the right answer but feels inefficient
student value_a timestamp
student
A A NaN 4
B B 5.0 6
答案1
得分: 5
一个选项是使用 groupby.tail
:
df.groupby('student').tail(1)
输出:
student value_a timestamp
2 A NaN 4
1 B 5.0 6
请注意,如果您想要最后一个时间戳,另一个选项是使用 groupby.idxmax
进行索引:
df.loc[df.groupby('student')['timestamp'].idxmax()]
英文:
One option is to use groupby.tail
:
df.groupby('student').tail(1)
Output:
student value_a timestamp
2 A NaN 4
1 B 5.0 6
Note that if you want the last timestamp, another option is to index with groupby.idxmax
:
df.loc[df.groupby('student')['timestamp'].idxmax()]
答案2
得分: 4
你可以尝试使用.nth
:
out = df.groupby('student').nth(-1)
print(out)
打印结果:
value_a timestamp
student
A NaN 4
B 5.0 6
英文:
You can try .nth
:
out = df.groupby('student').nth(-1)
print(out)
Prints:
value_a timestamp
student
A NaN 4
B 5.0 6
答案3
得分: 4
你选择使用 nth
或 tail
时,首先需要对数据框进行排序。之后可以删除重复项:
>>> df.sort_values('timestamp').drop_duplicates('student', keep='last')
student value_a timestamp
2 A NaN 4
1 B 5.0 6
英文:
You have to sort your dataframe first if you choose to use nth
or tail
. After that you can drop the dupes:
>>> df.sort_values('timestamp').drop_duplicates('student', keep='last')
student value_a timestamp
2 A NaN 4
1 B 5.0 6
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论