2023年7月11日 03:34:10go评论106阅读模式

英文:

How do I get the last record in a groupby() in pandas?

问题

我有一个名为df的数据帧，其中包含每个学生的多条记录。我经常需要获取具有最后时间戳的记录。

要做到这一点，最好的方法是什么？以前我一直在使用last()，但这会返回最后一个非空值，而我实际上只想要最后一个值，无论是否为空。

使用apply(lambda r: r.iloc[-1])可以工作，但这段代码看起来不太美观（我不喜欢使用apply，而且从经验上看，它可能会感觉慢和低效，可能是因为apply的原因）。

应该如何正确地完成这个任务？

import pandas as pd
import numpy as np
# 创建示例DataFrame
df = pd.DataFrame([["A", 2, 3], ["B", 5, 6], ["A", np.NaN, 4]], columns=["student", "value_a", "timestamp"]).sort_values("timestamp")
df
# 按学生分组并获取最后一个记录
df.groupby("student").tail(1)

这将返回每个学生的最后一条记录，包括timestamp列中的最后一个时间戳。

英文:

I have a dataframe df which has a number of records for each student. Frequently I want to get the one with the last timestamp.

What is the best way to do this? Previously I had been using last() but this gives the last non null value when really I just want the last value, null or otherwise.

Using apply(lambda r: r.iloc[-1]) works, but the code feels ugly (I hate using an apply and anecdotally it feels slow and inefficient, likely because of the apply).

What is the right way to do this?

(Pdb) df = pd.DataFrame([[&quot;A&quot;,2,3],[&quot;B&quot;,5,6],[&quot;A&quot;,np.NaN,4]], columns=[&quot;student&quot;, &quot;value_a&quot;, &quot;timestamp&quot;]).sort_values(&quot;timestamp&quot;)
(Pdb) df
  student  value_a  timestamp
0       A      2.0          3
2       A      NaN          4
1       B      5.0          6
(Pdb) df.groupby(&quot;student&quot;).last()
# This gives the wrong answer
         value_a  timestamp
student                    
A            2.0          4
B            5.0          6
(Pdb) df.groupby(&quot;student&quot;).apply(lambda r: r.iloc[-1])
# This gives the right answer but feels inefficient
        student  value_a  timestamp
student                            
A             A      NaN          4
B             B      5.0          6

答案1

得分: 5

一个选项是使用 groupby.tail：

df.groupby('student').tail(1)

输出：

  student  value_a  timestamp
2       A      NaN          4
1       B      5.0          6

请注意，如果您想要最后一个时间戳，另一个选项是使用 groupby.idxmax 进行索引：

df.loc[df.groupby('student')['timestamp'].idxmax()]

英文:

One option is to use groupby.tail:

df.groupby(&#39;student&#39;).tail(1)

Output:

  student  value_a  timestamp
2       A      NaN          4
1       B      5.0          6

Note that if you want the last timestamp, another option is to index with groupby.idxmax:

df.loc[df.groupby(&#39;student&#39;)[&#39;timestamp&#39;].idxmax()]

答案2

得分: 4

你可以尝试使用.nth：

out = df.groupby('student').nth(-1)
print(out)

打印结果：

         value_a  timestamp
student                    
A            NaN          4
B            5.0          6

英文:

You can try .nth:

out = df.groupby(&#39;student&#39;).nth(-1)
print(out)

Prints:

         value_a  timestamp
student                    
A            NaN          4
B            5.0          6

答案3

得分: 4

你选择使用 nth 或 tail 时，首先需要对数据框进行排序。之后可以删除重复项：

&gt;&gt;&gt; df.sort_values('timestamp').drop_duplicates('student', keep='last')
  student  value_a  timestamp
2       A      NaN          4
1       B      5.0          6

英文:

You have to sort your dataframe first if you choose to use nth or tail. After that you can drop the dupes:

&gt;&gt;&gt; df.sort_values(&#39;timestamp&#39;).drop_duplicates(&#39;student&#39;, keep=&#39;last&#39;)
  student  value_a  timestamp
2       A      NaN          4
1       B      5.0          6

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在 pandas 中获取 groupby() 中的最后一条记录？

问题

答案1

答案2

答案3

使用go-python3库时出现许多未定义的引用。

PySide6 widgets not showing when using pyside6-uic

比较 pandas 数据框的列，忽略文本前面的数字。

如何自动点击网页上的按钮以下载文件？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。