2023年8月8日 22:13:33go评论108阅读模式

英文:

Identify index presentations and re-attendances within a 28 day period

问题

我的数据集记录了个人对某个地点的演示。下面是表格形式的记录，但也包含了下方的dput格式。

标识符	日期
"A1"	"28/01/2020"
"A1"	"01/04/2020"
"A1"	"16/08/2020"
"A1"	"20/08/2020"
"A1"	"30/08/2020"
"A1"	"31/10/2020"
"A1"	"14/11/2020"
"A1"	"26/11/2020"
"A1"	"25/12/2020"
"A1"	"04/05/2021"
"A1"	"08/05/2021"
"A1"	"26/07/2021"

个人的出席是零散的，并且有时在28天内多次返回。

按日期顺序工作，我想找出至少相隔28天的首次访问和在28天窗口内再次访问的情况。

正确处理后的结果应该如下所示：

标识符	日期	首次访问	首次访问ID
"A1"	"28/01/2020"	TRUE	1
"A1"	"01/04/2020"	TRUE	2
"A1"	"16/08/2020"	TRUE	3
"A1"	"20/08/2020"	FALSE	3
"A1"	"30/08/2020"	FALSE	3
"A1"	"31/10/2020"	TRUE	4
"A1"	"14/11/2020"	FALSE	4
"A1"	"26/11/2020"	FALSE	4
"A1"	"25/12/2020"	TRUE	5
"A1"	"04/05/2021"	TRUE	6
"A1"	"08/05/2021"	FALSE	6
"A1"	"26/07/2021"	TRUE	7

我更喜欢使用dplyr来解决问题，因为我对它最熟悉。

Dput:

data &lt;- structure(list(identifier = c(&quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;,
                                  &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;),
                    date = structure(c(18520, 18504, 18621, 18580, 18353, 18751,
                                       18289, 18494, 18592, 18490, 18755, 18834, 18566),
                                     class = &quot;Date&quot;)),
               row.names = c(NA, -13L), class = &quot;data.frame&quot;)

英文:

My dataset records presentations to a location by an individual. This is tabulated below but included as a dput further down.

Identifier	date
"A1"	"28/01/2020"
"A1"	"01/04/2020"
"A1"	"16/08/2020"
"A1"	"20/08/2020"
"A1"	"30/08/2020"
"A1"	"31/10/2020"
"A1"	"14/11/2020"
"A1"	"26/11/2020"
"A1"	"25/12/2020"
"A1"	"04/05/2021"
"A1"	"08/05/2021"
"A1"	"26/07/2021"

The individual attends sporadically and on occasions returns several times within a 28 day period.

Working in date order I want to identify index cases that occur at least 28 days apart and reattendances that occur within the 28 day window.

The resulting data when processed correctly would look like this:

Identifier	date	index_visit	index_id
"A1"	"28/01/2020"	TRUE	1
"A1"	"01/04/2020"	TRUE	2
"A1"	"16/08/2020"	TRUE	3
"A1"	"20/08/2020"	FALSE	3
"A1"	"30/08/2020"	FALSE	3
"A1"	"31/10/2020"	TRUE	4
"A1"	"14/11/2020"	FALSE	4
"A1"	"26/11/2020"	FALSE	4
"A1"	"25/12/2020"	TRUE	5
"A1"	"04/05/2021"	TRUE	6
"A1"	"08/05/2021"	FALSE	6
"A1"	"26/07/2021"	TRUE	7

I would prefer solutions that make use of dplyr as this is what I am most familiar with.

Dput:

data &lt;- structure(list(identifier = c(&quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;,
                                  &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;, &quot;A1&quot;),
                    date = structure(c(18520, 18504, 18621, 18580, 18353, 18751,
                                       18289, 18494, 18592, 18490, 18755, 18834, 18566),
                                     class = &quot;Date&quot;)),
               row.names = c(NA, -13L), class = &quot;data.frame&quot;)

答案1

得分: 1

使用dplyr包，你可以使用lag()函数计算日期差异，然后使用cumsum函数累加变化的次数。

library(dplyr)
data %>%
  arrange(date) %>%
  mutate(
    index_visit = difftime(date, lag(default = as.Date("1900-01-01"), date), units = "day") > 28,
    index_id = cumsum(index_visit)
  )
#    identifier       date index_visit index_id
# 1          A1 2020-01-28        TRUE        1
# 2          A1 2020-04-01        TRUE        2
# 3          A1 2020-08-16        TRUE        3
# 4          A1 2020-08-20       FALSE        3
# 5          A1 2020-08-30       FALSE        3
# 6          A1 2020-09-15       FALSE        3
# 7          A1 2020-10-31        TRUE        4
# 8          A1 2020-11-14       FALSE        4
# 9          A1 2020-11-26       FALSE        4
# 10         A1 2020-12-25        TRUE        5
# 11         A1 2021-05-04        TRUE        6
# 12         A1 2021-05-08       FALSE        6
# 13         A1 2021-07-26        TRUE        7

因此，第一行没有前一个值，我使用了1900年1月1日作为填充值，以便第一个日期成为索引访问。

英文:

With dplyr you can calculate the date difference with lag() and then sum up the number of times it changes with cumsum

library(dplyr)
data %&gt;% 
  arrange(date) %&gt;% 
  mutate(
    index_visit=difftime(date, lag(default = as.Date(&quot;1900-01-01&quot;), date), units = &quot;day&quot;)&gt;28,
    index_id = cumsum(index_visit)
  )
#    identifier       date index_visit index_id
# 1          A1 2020-01-28        TRUE        1
# 2          A1 2020-04-01        TRUE        2
# 3          A1 2020-08-16        TRUE        3
# 4          A1 2020-08-20       FALSE        3
# 5          A1 2020-08-30       FALSE        3
# 6          A1 2020-09-15       FALSE        3
# 7          A1 2020-10-31        TRUE        4
# 8          A1 2020-11-14       FALSE        4
# 9          A1 2020-11-26       FALSE        4
# 10         A1 2020-12-25        TRUE        5
# 11         A1 2021-05-04        TRUE        6
# 12         A1 2021-05-08       FALSE        6
# 13         A1 2021-07-26        TRUE        7

So the first row doesn't have a lag value, I used Jan 1 1900 as a filler so the first date would be an index visit

答案2

得分: 1

我的函数time_episodes()最初是用来识别疾病再感染的情况，但适用于任何分段事件分析。

它还通过lubridate支持更复杂的时间单位，如月份。

# Uncomment to install
# remotes::install_github(&quot;NicChr/timeplyr&quot;)
library(tidyverse)
library(timeplyr)
data <- as_tibble(data)
episodic_data <- data %>%
  group_by(identifier) %>%
  time_episodes(date, time_by = "day", window = 28, switch_on_boundary = TRUE)
episodic_data %>%
  arrange(identifier, date) %>%
  mutate(index_visit = ep_id_new > 0)
# A tibble: 13 × 7
# Groups:   identifier [1]
   identifier date       t_elapsed ep_start   ep_id ep_id_new index_visit
   <chr>      <date>         <dbl> <date>     <int>     <int> <lgl>      
 1 A1         2020-01-28         0 2020-01-28     1         1 TRUE       
 2 A1         2020-04-01        64 2020-04-01     2         2 TRUE       
 3 A1         2020-08-16       137 2020-08-16     3         3 TRUE       
 4 A1         2020-08-20         4 2020-08-16     3         0 FALSE      
 5 A1         2020-08-30        10 2020-08-16     3         0 FALSE      
 6 A1         2020-09-15        16 2020-08-16     3         0 FALSE      
 7 A1         2020-10-31        46 2020-10-31     4         4 TRUE       
 8 A1         2020-11-14        14 2020-10-31     4         0 FALSE      
 9 A1         2020-11-26        12 2020-10-31     4         0 FALSE      
10 A1         2020-12-25        29 2020-12-25     5         5 TRUE       
11 A1         2021-05-04       130 2021-05-04     6         6 TRUE       
12 A1         2021-05-08         4 2021-05-04     6         0 FALSE      
13 A1         2021-07-26        79 2021-07-26     7         7 TRUE

替代方法

data %>%
  group_by(identifier) %>%
  arrange(identifier, date) %>%
  mutate(elapsed = time_elapsed(date, "days")) %>%
  mutate(index_visit = row_number() == 1 | elapsed >= 28)
# A tibble: 13 × 4
# Groups:   identifier [1]
   identifier date       elapsed index_visit
   <chr>      <date>       <dbl> <lgl>      
 1 A1         2020-01-28      NA TRUE       
 2 A1         2020-04-01      64 TRUE       
 3 A1         2020-08-16     137 TRUE       
 4 A1         2020-08-20       4 FALSE      
 5 A1         2020-08-30      10 FALSE      
 6 A1         2020-09-15      16 FALSE      
 7 A1         2020-10-31      46 TRUE       
 8 A1         2020-11-14      14 FALSE      
 9 A1         2020-11-26      12 FALSE      
10 A1         2020-12-25      29 TRUE       
11 A1         2021-05-04     130 TRUE       
12 A1         2021-05-08       4 FALSE      
13 A1         2021-07-26      79 TRUE

英文:

My function time_episodes() was originally written to identify episodes of disease reinfection but is applicable to any episodic event analysis.

It also supports more complex time units like months through lubridate.

# Uncomment to install
# remotes::install_github(&quot;NicChr/timeplyr&quot;)
library(tidyverse)
library(timeplyr)
data &lt;- as_tibble(data)
episodic_data &lt;- data %&gt;%
  group_by(identifier) %&gt;%
  time_episodes(date, time_by = &quot;day&quot;, window = 28, switch_on_boundary = TRUE)
episodic_data %&gt;%
  arrange(identifier, date) %&gt;%
  mutate(index_visit = ep_id_new &gt; 0)
# A tibble: 13 &#215; 7
# Groups:   identifier [1]
   identifier date       t_elapsed ep_start   ep_id ep_id_new index_visit
   &lt;chr&gt;      &lt;date&gt;         &lt;dbl&gt; &lt;date&gt;     &lt;int&gt;     &lt;int&gt; &lt;lgl&gt;      
 1 A1         2020-01-28         0 2020-01-28     1         1 TRUE       
 2 A1         2020-04-01        64 2020-04-01     2         2 TRUE       
 3 A1         2020-08-16       137 2020-08-16     3         3 TRUE       
 4 A1         2020-08-20         4 2020-08-16     3         0 FALSE      
 5 A1         2020-08-30        10 2020-08-16     3         0 FALSE      
 6 A1         2020-09-15        16 2020-08-16     3         0 FALSE      
 7 A1         2020-10-31        46 2020-10-31     4         4 TRUE       
 8 A1         2020-11-14        14 2020-10-31     4         0 FALSE      
 9 A1         2020-11-26        12 2020-10-31     4         0 FALSE      
10 A1         2020-12-25        29 2020-12-25     5         5 TRUE       
11 A1         2021-05-04       130 2021-05-04     6         6 TRUE       
12 A1         2021-05-08         4 2021-05-04     6         0 FALSE      
13 A1         2021-07-26        79 2021-07-26     7         7 TRUE

Alternative method

data %&gt;%
  group_by(identifier) %&gt;%
  arrange(identifier, date) %&gt;%
  mutate(elapsed = time_elapsed(date, &quot;days&quot;)) %&gt;%
  mutate(index_visit = row_number() == 1 | elapsed &gt;= 28)
# A tibble: 13 &#215; 4
# Groups:   identifier [1]
   identifier date       elapsed index_visit
   &lt;chr&gt;      &lt;date&gt;       &lt;dbl&gt; &lt;lgl&gt;      
 1 A1         2020-01-28      NA TRUE       
 2 A1         2020-04-01      64 TRUE       
 3 A1         2020-08-16     137 TRUE       
 4 A1         2020-08-20       4 FALSE      
 5 A1         2020-08-30      10 FALSE      
 6 A1         2020-09-15      16 FALSE      
 7 A1         2020-10-31      46 TRUE       
 8 A1         2020-11-14      14 FALSE      
 9 A1         2020-11-26      12 FALSE      
10 A1         2020-12-25      29 TRUE       
11 A1         2021-05-04     130 TRUE       
12 A1         2021-05-08       4 FALSE      
13 A1         2021-07-26      79 TRUE

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

识别在28天内的指标演示和再次出席。

问题

答案1

答案2

替代方法

Alternative method

使用`rows_append()`时添加因子水平。

命名列表的唯一元素

R的plot_ly动画在有两个轨迹的情况下，动画滑块存在问题。

使用purrr在多个列上进行多个映射的重新编码。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。