如何在R中基于最新的观测和日期行筛选并创建新的数据库?

huangapple go评论125阅读模式
英文:

How to filter and create a new database in r based on the latest observation and date rows?

问题

任何帮助都非常感激!

我有一个扩展的数据库(超过1000条记录),我想要删除一些行,只保留每个个体的最新信息。我不知道如何开始。

原始数据库

名字 年份 体重
约翰 2021-04-03 203
约翰 2022-08-02 198
约翰 2018-08-34 234
帕特里克 2014-05-09 176
帕特里克 2021-03-09 199
帕特里克 2020-09-03 200
皮特 2019-09-05 204
皮特 2017-07-14 209
皮特 2019-10-05 199

最终数据库

名字 年份 体重
约翰 2022-08-02 198
帕特里克 2021-03-09 199
皮特 2019-10-05 199
英文:

any help is much appreciated!

I have an extended database (more than 1000), and I would like to eliminate some rows and keep only the latest information on each individual's name. I have no idea how to start.

Original Database

Name Year Weight
John 2021-04-03 203
John 2022-08-02 198
John 2018-08-34 234
Patrick 2014-05-09 176
Patrick 2021-03-09 199
Patrick 2020-09-03 200
Peter 2019-09-05 204
Peter 2017-07-14 209
Peter 2019-10-05 199

Final Database

Name Year Weight
John 2022-08-02 198
Patrick 2021-03-09 199
Peter 2019-10-05 199

答案1

得分: 1

We could use slice_max

library(dplyr) # version >= 1.1.0
df1 %>%
    slice_max(Year, by = 'Name')

-output

     Name       Year Weight
1    John 2022-08-02    198
2 Patrick 2021-03-09    199
3   Peter 2019-10-05    199

Or with previous versions of dplyr

df1 %>%
   group_by(Name) %>%
   slice_max(Year) %>%
   ungroup
# A tibble: 3 × 3
  Name    Year       Weight
  <chr>   <chr>       <int>
1 John    2022-08-02    198
2 Patrick 2021-03-09    199
3 Peter   2019-10-05    199

Or in data.table

library(data.table)
setDT(df1)[df1[, .I[which.max(as.Date(Year))], Name]$V1]
      Name       Year Weight
1:    John 2022-08-02    198
2: Patrick 2021-03-09    199
3:   Peter 2019-10-05    199

data

df1 <- structure(list(Name = c("John", "John", "John", "Patrick", "Patrick", 
"Patrick", "Peter", "Peter", "Peter"), Year = c("2021-04-03", 
"2022-08-02", "2018-08-34", "2014-05-09", "2021-03-09", "2020-09-03", 
"2019-09-05", "2017-07-14", "2019-10-05"), Weight = c(203L, 198L, 
234L, 176L, 199L, 200L, 204L, 209L, 199L)), 
class = "data.frame", row names = c(NA, 
-9L))
英文:

We could use slice_max

library(dplyr) # version >= 1.1.0
df1 %>%
    slice_max(Year, by = 'Name')

-output

     Name       Year Weight
1    John 2022-08-02    198
2 Patrick 2021-03-09    199
3   Peter 2019-10-05    199

Or with previous versions of dplyr

df1 %>%
   group_by(Name) %>%
   slice_max(Year) %>%
   ungroup
# A tibble: 3 × 3
  Name    Year       Weight
  <chr>   <chr>       <int>
1 John    2022-08-02    198
2 Patrick 2021-03-09    199
3 Peter   2019-10-05    199

Or in data.table

library(data.table)
setDT(df1)[df1[, .I[which.max(as.Date(Year))], Name]$V1]
      Name       Year Weight
1:    John 2022-08-02    198
2: Patrick 2021-03-09    199
3:   Peter 2019-10-05    199

data

df1 <- structure(list(Name = c("John", "John", "John", "Patrick", "Patrick", 
"Patrick", "Peter", "Peter", "Peter"), Year = c("2021-04-03", 
"2022-08-02", "2018-08-34", "2014-05-09", "2021-03-09", "2020-09-03", 
"2019-09-05", "2017-07-14", "2019-10-05"), Weight = c(203L, 198L, 
234L, 176L, 199L, 200L, 204L, 209L, 199L)), 
class = "data.frame", row.names = c(NA, 
-9L))

huangapple
  • 本文由 发表于 2023年3月4日 01:03:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/75629946.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定