2023年4月4日 14:16:27go评论91阅读模式

英文:

Group by in dplyr and determining if the most recent entry is repeated before

问题

在我的数据中，有```不同的人```在```4门课程```中以```不同的时间段```注册。
我试图弄清楚在一个课程中最近注册的人是否是该课程中的新人？

df &lt;- data.frame(course=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),
                 name=c(&quot;jessica&quot;,&quot;alex&quot;,&quot;rose&quot;,&quot;sam&quot;,&quot;paul&quot;,&quot;steve&quot;,&quot;sarah&quot;,&quot;julia&quot;,&quot;paul&quot;,&quot;alex&quot;,&quot;helen&quot;,&quot;adam&quot;,
                     &quot;jessica&quot;,&quot;moe&quot;,&quot;rose&quot;,&quot;dave&quot;),
                 date=as.Date(c(&quot;2023-02-10&quot;,&quot;2022-10-10&quot;,&quot;2022-10-10&quot;,&quot;2022-10-10&quot;,&quot;2022-11-10&quot;,&quot;2023-02-10&quot;,&quot;2022-11-10&quot;,
                                &quot;2022-11-10&quot;,&quot;2022-12-10&quot;,&quot;2022-12-10&quot;,&quot;2022-12-10&quot;,&quot;2022-12-10&quot;,&quot;2022-10-10&quot;,
                                &quot;2022-11-10&quot;,&quot;2023-02-10&quot;,&quot;2023-02-10&quot;)) )

df %&gt;% arrange(course,date)
   course    name       date
1       1 jessica 2022-10-10
2       1    paul 2022-11-10
3       1    paul 2022-12-10
4       1 jessica 2023-02-10
5       2    alex 2022-10-10
6       2     moe 2022-11-10
7       2    alex 2022-12-10
8       2   steve 2023-02-10
9       3    rose 2022-10-10
10      3   sarah 2022-11-10
11      3   helen 2022-12-10
12      3    rose 2023-02-10
13      4     sam 2022-10-10
14      4   julia 2022-11-10
15      4    adam 2022-12-10
16      4    dave 2023-02-10

例如，最近注册在课程1中的人是jessica，但她以前也在该课程中（因此该课程的结果为0）。而在课程2中最近注册的人是steve，但她以前没有在该课程中（因此该课程的结果为1）

因此，结果可能是这样的：

course   new_registered
1             0
2             1
3             0
4             1

感谢您的帮助！

英文:

In my data, there are different people who are registered in 4 courses in different timeframes.
I am trying to figure out whether the most recent registered person in a course is a new person in that course or not?

df &lt;- data.frame(course=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),
                 name=c(&quot;jessica&quot;,&quot;alex&quot;,&quot;rose&quot;,&quot;sam&quot;,&quot;paul&quot;,&quot;steve&quot;,&quot;sarah&quot;,&quot;julia&quot;,&quot;paul&quot;,&quot;alex&quot;,&quot;helen&quot;,&quot;adam&quot;,
                     &quot;jessica&quot;,&quot;moe&quot;,&quot;rose&quot;,&quot;dave&quot;),
                 date=as.Date(c(&quot;2023-02-10&quot;,&quot;2022-10-10&quot;,&quot;2022-10-10&quot;,&quot;2022-10-10&quot;,&quot;2022-11-10&quot;,&quot;2023-02-10&quot;,&quot;2022-11-10&quot;,
                                &quot;2022-11-10&quot;,&quot;2022-12-10&quot;,&quot;2022-12-10&quot;,&quot;2022-12-10&quot;,&quot;2022-12-10&quot;,&quot;2022-10-10&quot;,
                                &quot;2022-11-10&quot;,&quot;2023-02-10&quot;,&quot;2023-02-10&quot;)) )

df %&gt;% arrange(course,date)
   course    name       date
1       1 jessica 2022-10-10
2       1    paul 2022-11-10
3       1    paul 2022-12-10
4       1 jessica 2023-02-10
5       2    alex 2022-10-10
6       2     moe 2022-11-10
7       2    alex 2022-12-10
8       2   steve 2023-02-10
9       3    rose 2022-10-10
10      3   sarah 2022-11-10
11      3   helen 2022-12-10
12      3    rose 2023-02-10
13      4     sam 2022-10-10
14      4   julia 2022-11-10
15      4    adam 2022-12-10
16      4    dave 2023-02-10

For example, the most recent person registered in course 1 is jessica but she was also in the course before ( so the outcome would be 0 for this course). And, the most recent person registered in course 2 is steve but she was NOT in the course before (so the outcome would be 1 for this course)

Hence, the outcome could be like:

course   new_registered
1             0
2             1
3             0
4             1

Thank you so much for the help!

答案1

得分: 1

库(dplyr, 警告=假)
df |&gt; 
  排序(course, date) |&gt; 
  按课程分组 |&gt; 
  汇总(新注册 = +(!last(name) %in% rev(name)[-1] &amp; n() &gt; 1))

英文:

Here is one option to check whether the last registered name was registered in the course in the past:

library(dplyr, warn=FALSE)
df |&gt; 
  arrange(course, date) |&gt; 
  group_by(course) |&gt; 
  summarise(new_registered = +(!last(name) %in% rev(name)[-1]))
#&gt; # A tibble: 4 &#215; 2
#&gt;   course new_registered
#&gt;    &lt;dbl&gt;          &lt;int&gt;
#&gt; 1      1              0
#&gt; 2      2              1
#&gt; 3      3              0
#&gt; 4      4              1

EDIT As desired the edited code will additionally check if there is more than one row and will return FALSE or 0 if not:

df &lt;- data.frame(
  course = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 5),
  name = c(
    &quot;jessica&quot;, &quot;alex&quot;, &quot;rose&quot;, &quot;sam&quot;, &quot;paul&quot;, &quot;steve&quot;, &quot;sarah&quot;, &quot;julia&quot;, &quot;paul&quot;, &quot;alex&quot;, &quot;helen&quot;, &quot;adam&quot;,
    &quot;jessica&quot;, &quot;moe&quot;, &quot;rose&quot;, &quot;dave&quot;, &quot;will&quot;
  ),
  date = as.Date(c(
    &quot;2023-02-10&quot;, &quot;2022-10-10&quot;, &quot;2022-10-10&quot;, &quot;2022-10-10&quot;, &quot;2022-11-10&quot;, &quot;2023-02-10&quot;, &quot;2022-11-10&quot;,
    &quot;2022-11-10&quot;, &quot;2022-12-10&quot;, &quot;2022-12-10&quot;, &quot;2022-12-10&quot;, &quot;2022-12-10&quot;, &quot;2022-10-10&quot;,
    &quot;2022-11-10&quot;, &quot;2023-02-10&quot;, &quot;2023-02-10&quot;, &quot;2023-02-10&quot;
  ))
)
library(dplyr, warn = FALSE)
df |&gt;
  arrange(course, date) |&gt;
  group_by(course) |&gt;
  summarise(new_registered = +(!last(name) %in% rev(name)[-1] &amp; n() &gt; 1))
#&gt; # A tibble: 5 &#215; 2
#&gt;   course new_registered
#&gt;    &lt;dbl&gt;          &lt;int&gt;
#&gt; 1      1              0
#&gt; 2      2              1
#&gt; 3      3              0
#&gt; 4      4              1
#&gt; 5      5              0

答案2

得分: 1

使用reframe，检查日期最近的情况下'name'的计数为1。

library(dplyr) # 版本 >= 1.1.0
df %>%
  reframe(new_registered = +(sum(name[which.max(date)] == name) == 1),
     .by = course)

输出结果

   course new_registered
1      1              0
2      2              1
3      3              0
4      4              1

英文:

Using reframe, check the count of 'name' where the date is the recent as 1

library(dplyr)# version &gt;= 1.1.0
df %&gt;%
  reframe(new_registered = +(sum(name[which.max(date)] == name) == 1),
     .by = course)

-output

   course new_registered
1      1              0
2      2              1
3      3              0
4      4              1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在dplyr中使用Group by并确定最近的条目是否重复。

问题

答案1

答案2

如何在R中配置libcurl以在通过HTTPS下载时使用自定义CA捆绑包

How can I get the largest number of occurrences of unique values by group in R

生成stargazer的子组摘要统计信息。

Boxplot with additional lines for 10th and 90th percentile in R

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。