英文:
How to filter and create a new database in r based on the latest observation and date rows?
问题
任何帮助都非常感激!
我有一个扩展的数据库(超过1000条记录),我想要删除一些行,只保留每个个体的最新信息。我不知道如何开始。
原始数据库
名字 | 年份 | 体重 |
---|---|---|
约翰 | 2021-04-03 | 203 |
约翰 | 2022-08-02 | 198 |
约翰 | 2018-08-34 | 234 |
帕特里克 | 2014-05-09 | 176 |
帕特里克 | 2021-03-09 | 199 |
帕特里克 | 2020-09-03 | 200 |
皮特 | 2019-09-05 | 204 |
皮特 | 2017-07-14 | 209 |
皮特 | 2019-10-05 | 199 |
最终数据库
名字 | 年份 | 体重 |
---|---|---|
约翰 | 2022-08-02 | 198 |
帕特里克 | 2021-03-09 | 199 |
皮特 | 2019-10-05 | 199 |
英文:
any help is much appreciated!
I have an extended database (more than 1000), and I would like to eliminate some rows and keep only the latest information on each individual's name. I have no idea how to start.
Original Database
Name | Year | Weight |
---|---|---|
John | 2021-04-03 | 203 |
John | 2022-08-02 | 198 |
John | 2018-08-34 | 234 |
Patrick | 2014-05-09 | 176 |
Patrick | 2021-03-09 | 199 |
Patrick | 2020-09-03 | 200 |
Peter | 2019-09-05 | 204 |
Peter | 2017-07-14 | 209 |
Peter | 2019-10-05 | 199 |
Final Database
Name | Year | Weight |
---|---|---|
John | 2022-08-02 | 198 |
Patrick | 2021-03-09 | 199 |
Peter | 2019-10-05 | 199 |
答案1
得分: 1
We could use slice_max
library(dplyr) # version >= 1.1.0
df1 %>%
slice_max(Year, by = 'Name')
-output
Name Year Weight
1 John 2022-08-02 198
2 Patrick 2021-03-09 199
3 Peter 2019-10-05 199
Or with previous versions of dplyr
df1 %>%
group_by(Name) %>%
slice_max(Year) %>%
ungroup
# A tibble: 3 × 3
Name Year Weight
<chr> <chr> <int>
1 John 2022-08-02 198
2 Patrick 2021-03-09 199
3 Peter 2019-10-05 199
Or in data.table
library(data.table)
setDT(df1)[df1[, .I[which.max(as.Date(Year))], Name]$V1]
Name Year Weight
1: John 2022-08-02 198
2: Patrick 2021-03-09 199
3: Peter 2019-10-05 199
data
df1 <- structure(list(Name = c("John", "John", "John", "Patrick", "Patrick",
"Patrick", "Peter", "Peter", "Peter"), Year = c("2021-04-03",
"2022-08-02", "2018-08-34", "2014-05-09", "2021-03-09", "2020-09-03",
"2019-09-05", "2017-07-14", "2019-10-05"), Weight = c(203L, 198L,
234L, 176L, 199L, 200L, 204L, 209L, 199L)),
class = "data.frame", row names = c(NA,
-9L))
英文:
We could use slice_max
library(dplyr) # version >= 1.1.0
df1 %>%
slice_max(Year, by = 'Name')
-output
Name Year Weight
1 John 2022-08-02 198
2 Patrick 2021-03-09 199
3 Peter 2019-10-05 199
Or with previous versions of dplyr
df1 %>%
group_by(Name) %>%
slice_max(Year) %>%
ungroup
# A tibble: 3 × 3
Name Year Weight
<chr> <chr> <int>
1 John 2022-08-02 198
2 Patrick 2021-03-09 199
3 Peter 2019-10-05 199
Or in data.table
library(data.table)
setDT(df1)[df1[, .I[which.max(as.Date(Year))], Name]$V1]
Name Year Weight
1: John 2022-08-02 198
2: Patrick 2021-03-09 199
3: Peter 2019-10-05 199
data
df1 <- structure(list(Name = c("John", "John", "John", "Patrick", "Patrick",
"Patrick", "Peter", "Peter", "Peter"), Year = c("2021-04-03",
"2022-08-02", "2018-08-34", "2014-05-09", "2021-03-09", "2020-09-03",
"2019-09-05", "2017-07-14", "2019-10-05"), Weight = c(203L, 198L,
234L, 176L, 199L, 200L, 204L, 209L, 199L)),
class = "data.frame", row.names = c(NA,
-9L))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论