英文:
Group by in dplyr and determining if the most recent entry is repeated before
问题
在我的数据中,有```不同的人```在```4门课程```中以```不同的时间段```注册。
我试图弄清楚在一个课程中最近注册的人是否是该课程中的新人?
df <- data.frame(course=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),
name=c("jessica","alex","rose","sam","paul","steve","sarah","julia","paul","alex","helen","adam",
"jessica","moe","rose","dave"),
date=as.Date(c("2023-02-10","2022-10-10","2022-10-10","2022-10-10","2022-11-10","2023-02-10","2022-11-10",
"2022-11-10","2022-12-10","2022-12-10","2022-12-10","2022-12-10","2022-10-10",
"2022-11-10","2023-02-10","2023-02-10")) )
df %>% arrange(course,date)
course name date
1 1 jessica 2022-10-10
2 1 paul 2022-11-10
3 1 paul 2022-12-10
4 1 jessica 2023-02-10
5 2 alex 2022-10-10
6 2 moe 2022-11-10
7 2 alex 2022-12-10
8 2 steve 2023-02-10
9 3 rose 2022-10-10
10 3 sarah 2022-11-10
11 3 helen 2022-12-10
12 3 rose 2023-02-10
13 4 sam 2022-10-10
14 4 julia 2022-11-10
15 4 adam 2022-12-10
16 4 dave 2023-02-10
例如,最近注册在课程1
中的人是jessica
,但她以前也在该课程中(因此该课程的结果为0
)。而在课程2
中最近注册的人是steve
,但她以前没有在该课程中(因此该课程的结果为1
)
因此,结果可能是这样的:
course new_registered
1 0
2 1
3 0
4 1
感谢您的帮助!
英文:
In my data, there are different people
who are registered in 4 courses
in different timeframes
.
I am trying to figure out whether the most recent registered person in a course is a new person in that course or not?
df <- data.frame(course=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),
name=c("jessica","alex","rose","sam","paul","steve","sarah","julia","paul","alex","helen","adam",
"jessica","moe","rose","dave"),
date=as.Date(c("2023-02-10","2022-10-10","2022-10-10","2022-10-10","2022-11-10","2023-02-10","2022-11-10",
"2022-11-10","2022-12-10","2022-12-10","2022-12-10","2022-12-10","2022-10-10",
"2022-11-10","2023-02-10","2023-02-10")) )
df %>% arrange(course,date)
course name date
1 1 jessica 2022-10-10
2 1 paul 2022-11-10
3 1 paul 2022-12-10
4 1 jessica 2023-02-10
5 2 alex 2022-10-10
6 2 moe 2022-11-10
7 2 alex 2022-12-10
8 2 steve 2023-02-10
9 3 rose 2022-10-10
10 3 sarah 2022-11-10
11 3 helen 2022-12-10
12 3 rose 2023-02-10
13 4 sam 2022-10-10
14 4 julia 2022-11-10
15 4 adam 2022-12-10
16 4 dave 2023-02-10
For example, the most recent person registered in course 1
is jessica
but she was also in the course before ( so the outcome would be 0
for this course). And, the most recent person registered in course 2
is steve
but she was NOT in the course before (so the outcome would be 1
for this course)
Hence, the outcome could be like:
course new_registered
1 0
2 1
3 0
4 1
Thank you so much for the help!
答案1
得分: 1
库(dplyr, 警告=假)
df |>
排序(course, date) |>
按课程分组 |>
汇总(新注册 = +(!last(name) %in% rev(name)[-1] & n() > 1))
英文:
Here is one option to check whether the last registered name was registered in the course in the past:
library(dplyr, warn=FALSE)
df |>
arrange(course, date) |>
group_by(course) |>
summarise(new_registered = +(!last(name) %in% rev(name)[-1]))
#> # A tibble: 4 × 2
#> course new_registered
#> <dbl> <int>
#> 1 1 0
#> 2 2 1
#> 3 3 0
#> 4 4 1
EDIT As desired the edited code will additionally check if there is more than one row and will return FALSE
or 0
if not:
df <- data.frame(
course = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 5),
name = c(
"jessica", "alex", "rose", "sam", "paul", "steve", "sarah", "julia", "paul", "alex", "helen", "adam",
"jessica", "moe", "rose", "dave", "will"
),
date = as.Date(c(
"2023-02-10", "2022-10-10", "2022-10-10", "2022-10-10", "2022-11-10", "2023-02-10", "2022-11-10",
"2022-11-10", "2022-12-10", "2022-12-10", "2022-12-10", "2022-12-10", "2022-10-10",
"2022-11-10", "2023-02-10", "2023-02-10", "2023-02-10"
))
)
library(dplyr, warn = FALSE)
df |>
arrange(course, date) |>
group_by(course) |>
summarise(new_registered = +(!last(name) %in% rev(name)[-1] & n() > 1))
#> # A tibble: 5 × 2
#> course new_registered
#> <dbl> <int>
#> 1 1 0
#> 2 2 1
#> 3 3 0
#> 4 4 1
#> 5 5 0
答案2
得分: 1
使用reframe
,检查日期最近的情况下'name'的计数为1。
library(dplyr) # 版本 >= 1.1.0
df %>%
reframe(new_registered = +(sum(name[which.max(date)] == name) == 1),
.by = course)
输出结果
course new_registered
1 1 0
2 2 1
3 3 0
4 4 1
英文:
Using reframe
, check the count of 'name' where the date is the recent as 1
library(dplyr)# version >= 1.1.0
df %>%
reframe(new_registered = +(sum(name[which.max(date)] == name) == 1),
.by = course)
-output
course new_registered
1 1 0
2 2 1
3 3 0
4 4 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论