在dplyr中使用Group by并确定最近的条目是否重复。

huangapple go评论61阅读模式
英文:

Group by in dplyr and determining if the most recent entry is repeated before

问题

在我的数据中,有```不同的人```在```4门课程```中以```不同的时间段```注册。
我试图弄清楚在一个课程中最近注册的人是否是该课程中的新人?
df <- data.frame(course=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),
                 name=c("jessica","alex","rose","sam","paul","steve","sarah","julia","paul","alex","helen","adam",
                     "jessica","moe","rose","dave"),
                 date=as.Date(c("2023-02-10","2022-10-10","2022-10-10","2022-10-10","2022-11-10","2023-02-10","2022-11-10",
                                "2022-11-10","2022-12-10","2022-12-10","2022-12-10","2022-12-10","2022-10-10",
                                "2022-11-10","2023-02-10","2023-02-10")) )  
df %>% arrange(course,date)
   course    name       date
1       1 jessica 2022-10-10
2       1    paul 2022-11-10
3       1    paul 2022-12-10
4       1 jessica 2023-02-10
5       2    alex 2022-10-10
6       2     moe 2022-11-10
7       2    alex 2022-12-10
8       2   steve 2023-02-10
9       3    rose 2022-10-10
10      3   sarah 2022-11-10
11      3   helen 2022-12-10
12      3    rose 2023-02-10
13      4     sam 2022-10-10
14      4   julia 2022-11-10
15      4    adam 2022-12-10
16      4    dave 2023-02-10

例如,最近注册在课程1中的人是jessica,但她以前也在该课程中(因此该课程的结果为0)。而在课程2中最近注册的人是steve,但她以前没有在该课程中(因此该课程的结果为1

因此,结果可能是这样的:

course   new_registered
1             0
2             1
3             0
4             1

感谢您的帮助!

英文:

In my data, there are different people who are registered in 4 courses in different timeframes.
I am trying to figure out whether the most recent registered person in a course is a new person in that course or not?

df <- data.frame(course=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),
                 name=c("jessica","alex","rose","sam","paul","steve","sarah","julia","paul","alex","helen","adam",
                     "jessica","moe","rose","dave"),
                 date=as.Date(c("2023-02-10","2022-10-10","2022-10-10","2022-10-10","2022-11-10","2023-02-10","2022-11-10",
                                "2022-11-10","2022-12-10","2022-12-10","2022-12-10","2022-12-10","2022-10-10",
                                "2022-11-10","2023-02-10","2023-02-10")) )  
df %>% arrange(course,date)
   course    name       date
1       1 jessica 2022-10-10
2       1    paul 2022-11-10
3       1    paul 2022-12-10
4       1 jessica 2023-02-10
5       2    alex 2022-10-10
6       2     moe 2022-11-10
7       2    alex 2022-12-10
8       2   steve 2023-02-10
9       3    rose 2022-10-10
10      3   sarah 2022-11-10
11      3   helen 2022-12-10
12      3    rose 2023-02-10
13      4     sam 2022-10-10
14      4   julia 2022-11-10
15      4    adam 2022-12-10
16      4    dave 2023-02-10

For example, the most recent person registered in course 1 is jessica but she was also in the course before ( so the outcome would be 0 for this course). And, the most recent person registered in course 2 is steve but she was NOT in the course before (so the outcome would be 1 for this course)

Hence, the outcome could be like:

course   new_registered
1             0
2             1
3             0
4             1

Thank you so much for the help!

答案1

得分: 1

库(dplyr, 警告=假)

df |> 
  排序(course, date) |> 
  按课程分组 |> 
  汇总(新注册 = +(!last(name) %in% rev(name)[-1] & n() > 1))
英文:

Here is one option to check whether the last registered name was registered in the course in the past:

library(dplyr, warn=FALSE)

df |> 
  arrange(course, date) |> 
  group_by(course) |> 
  summarise(new_registered = +(!last(name) %in% rev(name)[-1]))
#> # A tibble: 4 × 2
#>   course new_registered
#>    <dbl>          <int>
#> 1      1              0
#> 2      2              1
#> 3      3              0
#> 4      4              1

EDIT As desired the edited code will additionally check if there is more than one row and will return FALSE or 0 if not:

df <- data.frame(
  course = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 5),
  name = c(
    "jessica", "alex", "rose", "sam", "paul", "steve", "sarah", "julia", "paul", "alex", "helen", "adam",
    "jessica", "moe", "rose", "dave", "will"
  ),
  date = as.Date(c(
    "2023-02-10", "2022-10-10", "2022-10-10", "2022-10-10", "2022-11-10", "2023-02-10", "2022-11-10",
    "2022-11-10", "2022-12-10", "2022-12-10", "2022-12-10", "2022-12-10", "2022-10-10",
    "2022-11-10", "2023-02-10", "2023-02-10", "2023-02-10"
  ))
)

library(dplyr, warn = FALSE)

df |>
  arrange(course, date) |>
  group_by(course) |>
  summarise(new_registered = +(!last(name) %in% rev(name)[-1] & n() > 1))
#> # A tibble: 5 × 2
#>   course new_registered
#>    <dbl>          <int>
#> 1      1              0
#> 2      2              1
#> 3      3              0
#> 4      4              1
#> 5      5              0

答案2

得分: 1

使用reframe,检查日期最近的情况下'name'的计数为1。

library(dplyr) # 版本 >= 1.1.0
df %>%
  reframe(new_registered = +(sum(name[which.max(date)] == name) == 1),
     .by = course)

输出结果

   course new_registered
1      1              0
2      2              1
3      3              0
4      4              1
英文:

Using reframe, check the count of 'name' where the date is the recent as 1

library(dplyr)# version >= 1.1.0
df %>%
  reframe(new_registered = +(sum(name[which.max(date)] == name) == 1),
     .by = course)

-output

   course new_registered
1      1              0
2      2              1
3      3              0
4      4              1

huangapple
  • 本文由 发表于 2023年4月4日 14:16:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/75926045.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定