在dplyr中使用Group by并确定最近的条目是否重复。

huangapple go评论91阅读模式
英文:

Group by in dplyr and determining if the most recent entry is repeated before

问题

  1. 在我的数据中,有```不同的人``````4门课程```中以```不同的时间段```注册。
  2. 我试图弄清楚在一个课程中最近注册的人是否是该课程中的新人?
  1. df <- data.frame(course=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),
  2. name=c("jessica","alex","rose","sam","paul","steve","sarah","julia","paul","alex","helen","adam",
  3. "jessica","moe","rose","dave"),
  4. date=as.Date(c("2023-02-10","2022-10-10","2022-10-10","2022-10-10","2022-11-10","2023-02-10","2022-11-10",
  5. "2022-11-10","2022-12-10","2022-12-10","2022-12-10","2022-12-10","2022-10-10",
  6. "2022-11-10","2023-02-10","2023-02-10")) )
  1. df %>% arrange(course,date)
  2. course name date
  3. 1 1 jessica 2022-10-10
  4. 2 1 paul 2022-11-10
  5. 3 1 paul 2022-12-10
  6. 4 1 jessica 2023-02-10
  7. 5 2 alex 2022-10-10
  8. 6 2 moe 2022-11-10
  9. 7 2 alex 2022-12-10
  10. 8 2 steve 2023-02-10
  11. 9 3 rose 2022-10-10
  12. 10 3 sarah 2022-11-10
  13. 11 3 helen 2022-12-10
  14. 12 3 rose 2023-02-10
  15. 13 4 sam 2022-10-10
  16. 14 4 julia 2022-11-10
  17. 15 4 adam 2022-12-10
  18. 16 4 dave 2023-02-10

例如,最近注册在课程1中的人是jessica,但她以前也在该课程中(因此该课程的结果为0)。而在课程2中最近注册的人是steve,但她以前没有在该课程中(因此该课程的结果为1

因此,结果可能是这样的:

  1. course new_registered
  2. 1 0
  3. 2 1
  4. 3 0
  5. 4 1

感谢您的帮助!

英文:

In my data, there are different people who are registered in 4 courses in different timeframes.
I am trying to figure out whether the most recent registered person in a course is a new person in that course or not?

  1. df <- data.frame(course=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),
  2. name=c("jessica","alex","rose","sam","paul","steve","sarah","julia","paul","alex","helen","adam",
  3. "jessica","moe","rose","dave"),
  4. date=as.Date(c("2023-02-10","2022-10-10","2022-10-10","2022-10-10","2022-11-10","2023-02-10","2022-11-10",
  5. "2022-11-10","2022-12-10","2022-12-10","2022-12-10","2022-12-10","2022-10-10",
  6. "2022-11-10","2023-02-10","2023-02-10")) )
  1. df %>% arrange(course,date)
  2. course name date
  3. 1 1 jessica 2022-10-10
  4. 2 1 paul 2022-11-10
  5. 3 1 paul 2022-12-10
  6. 4 1 jessica 2023-02-10
  7. 5 2 alex 2022-10-10
  8. 6 2 moe 2022-11-10
  9. 7 2 alex 2022-12-10
  10. 8 2 steve 2023-02-10
  11. 9 3 rose 2022-10-10
  12. 10 3 sarah 2022-11-10
  13. 11 3 helen 2022-12-10
  14. 12 3 rose 2023-02-10
  15. 13 4 sam 2022-10-10
  16. 14 4 julia 2022-11-10
  17. 15 4 adam 2022-12-10
  18. 16 4 dave 2023-02-10

For example, the most recent person registered in course 1 is jessica but she was also in the course before ( so the outcome would be 0 for this course). And, the most recent person registered in course 2 is steve but she was NOT in the course before (so the outcome would be 1 for this course)

Hence, the outcome could be like:

  1. course new_registered
  2. 1 0
  3. 2 1
  4. 3 0
  5. 4 1

Thank you so much for the help!

答案1

得分: 1

  1. 库(dplyr, 警告=假)
  2. df |>
  3. 排序(course, date) |>
  4. 按课程分组 |>
  5. 汇总(新注册 = +(!last(name) %in% rev(name)[-1] & n() > 1))
英文:

Here is one option to check whether the last registered name was registered in the course in the past:

  1. library(dplyr, warn=FALSE)
  2. df |>
  3. arrange(course, date) |>
  4. group_by(course) |>
  5. summarise(new_registered = +(!last(name) %in% rev(name)[-1]))
  6. #> # A tibble: 4 × 2
  7. #> course new_registered
  8. #> <dbl> <int>
  9. #> 1 1 0
  10. #> 2 2 1
  11. #> 3 3 0
  12. #> 4 4 1

EDIT As desired the edited code will additionally check if there is more than one row and will return FALSE or 0 if not:

  1. df <- data.frame(
  2. course = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 5),
  3. name = c(
  4. "jessica", "alex", "rose", "sam", "paul", "steve", "sarah", "julia", "paul", "alex", "helen", "adam",
  5. "jessica", "moe", "rose", "dave", "will"
  6. ),
  7. date = as.Date(c(
  8. "2023-02-10", "2022-10-10", "2022-10-10", "2022-10-10", "2022-11-10", "2023-02-10", "2022-11-10",
  9. "2022-11-10", "2022-12-10", "2022-12-10", "2022-12-10", "2022-12-10", "2022-10-10",
  10. "2022-11-10", "2023-02-10", "2023-02-10", "2023-02-10"
  11. ))
  12. )
  13. library(dplyr, warn = FALSE)
  14. df |>
  15. arrange(course, date) |>
  16. group_by(course) |>
  17. summarise(new_registered = +(!last(name) %in% rev(name)[-1] & n() > 1))
  18. #> # A tibble: 5 × 2
  19. #> course new_registered
  20. #> <dbl> <int>
  21. #> 1 1 0
  22. #> 2 2 1
  23. #> 3 3 0
  24. #> 4 4 1
  25. #> 5 5 0

答案2

得分: 1

使用reframe,检查日期最近的情况下'name'的计数为1。

  1. library(dplyr) # 版本 >= 1.1.0
  2. df %>%
  3. reframe(new_registered = +(sum(name[which.max(date)] == name) == 1),
  4. .by = course)

输出结果

  1. course new_registered
  2. 1 1 0
  3. 2 2 1
  4. 3 3 0
  5. 4 4 1
英文:

Using reframe, check the count of 'name' where the date is the recent as 1

  1. library(dplyr)# version >= 1.1.0
  2. df %>%
  3. reframe(new_registered = +(sum(name[which.max(date)] == name) == 1),
  4. .by = course)

-output

  1. course new_registered
  2. 1 1 0
  3. 2 2 1
  4. 3 3 0
  5. 4 4 1

huangapple
  • 本文由 发表于 2023年4月4日 14:16:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/75926045.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定