识别在28天内的指标演示和再次出席。

huangapple go评论67阅读模式
英文:

Identify index presentations and re-attendances within a 28 day period

问题

我的数据集记录了个人对某个地点的演示。下面是表格形式的记录,但也包含了下方的dput格式。

标识符 日期
"A1" "28/01/2020"
"A1" "01/04/2020"
"A1" "16/08/2020"
"A1" "20/08/2020"
"A1" "30/08/2020"
"A1" "31/10/2020"
"A1" "14/11/2020"
"A1" "26/11/2020"
"A1" "25/12/2020"
"A1" "04/05/2021"
"A1" "08/05/2021"
"A1" "26/07/2021"

个人的出席是零散的,并且有时在28天内多次返回。

按日期顺序工作,我想找出至少相隔28天的首次访问和在28天窗口内再次访问的情况。

正确处理后的结果应该如下所示:

标识符 日期 首次访问 首次访问ID
"A1" "28/01/2020" TRUE 1
"A1" "01/04/2020" TRUE 2
"A1" "16/08/2020" TRUE 3
"A1" "20/08/2020" FALSE 3
"A1" "30/08/2020" FALSE 3
"A1" "31/10/2020" TRUE 4
"A1" "14/11/2020" FALSE 4
"A1" "26/11/2020" FALSE 4
"A1" "25/12/2020" TRUE 5
"A1" "04/05/2021" TRUE 6
"A1" "08/05/2021" FALSE 6
"A1" "26/07/2021" TRUE 7

我更喜欢使用dplyr来解决问题,因为我对它最熟悉。

Dput:

data <- structure(list(identifier = c("A1", "A1", "A1", "A1", "A1", "A1",
                                  "A1", "A1", "A1", "A1", "A1", "A1", "A1"),
                    date = structure(c(18520, 18504, 18621, 18580, 18353, 18751,
                                       18289, 18494, 18592, 18490, 18755, 18834, 18566),
                                     class = "Date")),
               row.names = c(NA, -13L), class = "data.frame")
英文:

My dataset records presentations to a location by an individual. This is tabulated below but included as a dput further down.

Identifier date
"A1" "28/01/2020"
"A1" "01/04/2020"
"A1" "16/08/2020"
"A1" "20/08/2020"
"A1" "30/08/2020"
"A1" "31/10/2020"
"A1" "14/11/2020"
"A1" "26/11/2020"
"A1" "25/12/2020"
"A1" "04/05/2021"
"A1" "08/05/2021"
"A1" "26/07/2021"

The individual attends sporadically and on occasions returns several times within a 28 day period.

Working in date order I want to identify index cases that occur at least 28 days apart and reattendances that occur within the 28 day window.

The resulting data when processed correctly would look like this:

Identifier date index_visit index_id
"A1" "28/01/2020" TRUE 1
"A1" "01/04/2020" TRUE 2
"A1" "16/08/2020" TRUE 3
"A1" "20/08/2020" FALSE 3
"A1" "30/08/2020" FALSE 3
"A1" "31/10/2020" TRUE 4
"A1" "14/11/2020" FALSE 4
"A1" "26/11/2020" FALSE 4
"A1" "25/12/2020" TRUE 5
"A1" "04/05/2021" TRUE 6
"A1" "08/05/2021" FALSE 6
"A1" "26/07/2021" TRUE 7

I would prefer solutions that make use of dplyr as this is what I am most familiar with.

Dput:

data <- structure(list(identifier = c("A1", "A1", "A1", "A1", "A1", "A1",
                                  "A1", "A1", "A1", "A1", "A1", "A1", "A1"),
                    date = structure(c(18520, 18504, 18621, 18580, 18353, 18751,
                                       18289, 18494, 18592, 18490, 18755, 18834, 18566),
                                     class = "Date")),
               row.names = c(NA, -13L), class = "data.frame")

答案1

得分: 1

使用dplyr包,你可以使用lag()函数计算日期差异,然后使用cumsum函数累加变化的次数。

library(dplyr)
data %>%
  arrange(date) %>%
  mutate(
    index_visit = difftime(date, lag(default = as.Date("1900-01-01"), date), units = "day") > 28,
    index_id = cumsum(index_visit)
  )
#    identifier       date index_visit index_id
# 1          A1 2020-01-28        TRUE        1
# 2          A1 2020-04-01        TRUE        2
# 3          A1 2020-08-16        TRUE        3
# 4          A1 2020-08-20       FALSE        3
# 5          A1 2020-08-30       FALSE        3
# 6          A1 2020-09-15       FALSE        3
# 7          A1 2020-10-31        TRUE        4
# 8          A1 2020-11-14       FALSE        4
# 9          A1 2020-11-26       FALSE        4
# 10         A1 2020-12-25        TRUE        5
# 11         A1 2021-05-04        TRUE        6
# 12         A1 2021-05-08       FALSE        6
# 13         A1 2021-07-26        TRUE        7

因此,第一行没有前一个值,我使用了1900年1月1日作为填充值,以便第一个日期成为索引访问。

英文:

With dplyr you can calculate the date difference with lag() and then sum up the number of times it changes with cumsum

library(dplyr)
data %>% 
  arrange(date) %>% 
  mutate(
    index_visit=difftime(date, lag(default = as.Date("1900-01-01"), date), units = "day")>28,
    index_id = cumsum(index_visit)
  )
#    identifier       date index_visit index_id
# 1          A1 2020-01-28        TRUE        1
# 2          A1 2020-04-01        TRUE        2
# 3          A1 2020-08-16        TRUE        3
# 4          A1 2020-08-20       FALSE        3
# 5          A1 2020-08-30       FALSE        3
# 6          A1 2020-09-15       FALSE        3
# 7          A1 2020-10-31        TRUE        4
# 8          A1 2020-11-14       FALSE        4
# 9          A1 2020-11-26       FALSE        4
# 10         A1 2020-12-25        TRUE        5
# 11         A1 2021-05-04        TRUE        6
# 12         A1 2021-05-08       FALSE        6
# 13         A1 2021-07-26        TRUE        7

So the first row doesn't have a lag value, I used Jan 1 1900 as a filler so the first date would be an index visit

答案2

得分: 1

我的函数time_episodes()最初是用来识别疾病再感染的情况,但适用于任何分段事件分析。

它还通过lubridate支持更复杂的时间单位,如月份。

# Uncomment to install
# remotes::install_github("NicChr/timeplyr")
library(tidyverse)
library(timeplyr)

data <- as_tibble(data)

episodic_data <- data %>%
  group_by(identifier) %>%
  time_episodes(date, time_by = "day", window = 28, switch_on_boundary = TRUE)

episodic_data %>%
  arrange(identifier, date) %>%
  mutate(index_visit = ep_id_new > 0)
# A tibble: 13 × 7
# Groups:   identifier [1]
   identifier date       t_elapsed ep_start   ep_id ep_id_new index_visit
   <chr>      <date>         <dbl> <date>     <int>     <int> <lgl>      
 1 A1         2020-01-28         0 2020-01-28     1         1 TRUE       
 2 A1         2020-04-01        64 2020-04-01     2         2 TRUE       
 3 A1         2020-08-16       137 2020-08-16     3         3 TRUE       
 4 A1         2020-08-20         4 2020-08-16     3         0 FALSE      
 5 A1         2020-08-30        10 2020-08-16     3         0 FALSE      
 6 A1         2020-09-15        16 2020-08-16     3         0 FALSE      
 7 A1         2020-10-31        46 2020-10-31     4         4 TRUE       
 8 A1         2020-11-14        14 2020-10-31     4         0 FALSE      
 9 A1         2020-11-26        12 2020-10-31     4         0 FALSE      
10 A1         2020-12-25        29 2020-12-25     5         5 TRUE       
11 A1         2021-05-04       130 2021-05-04     6         6 TRUE       
12 A1         2021-05-08         4 2021-05-04     6         0 FALSE      
13 A1         2021-07-26        79 2021-07-26     7         7 TRUE 

替代方法

data %>%
  group_by(identifier) %>%
  arrange(identifier, date) %>%
  mutate(elapsed = time_elapsed(date, "days")) %>%
  mutate(index_visit = row_number() == 1 | elapsed >= 28)
# A tibble: 13 × 4
# Groups:   identifier [1]
   identifier date       elapsed index_visit
   <chr>      <date>       <dbl> <lgl>      
 1 A1         2020-01-28      NA TRUE       
 2 A1         2020-04-01      64 TRUE       
 3 A1         2020-08-16     137 TRUE       
 4 A1         2020-08-20       4 FALSE      
 5 A1         2020-08-30      10 FALSE      
 6 A1         2020-09-15      16 FALSE      
 7 A1         2020-10-31      46 TRUE       
 8 A1         2020-11-14      14 FALSE      
 9 A1         2020-11-26      12 FALSE      
10 A1         2020-12-25      29 TRUE       
11 A1         2021-05-04     130 TRUE       
12 A1         2021-05-08       4 FALSE      
13 A1         2021-07-26      79 TRUE   
英文:

My function time_episodes() was originally written to identify episodes of disease reinfection but is applicable to any episodic event analysis.

It also supports more complex time units like months through lubridate.

# Uncomment to install
# remotes::install_github(&quot;NicChr/timeplyr&quot;)
library(tidyverse)
library(timeplyr)

data &lt;- as_tibble(data)

episodic_data &lt;- data %&gt;%
  group_by(identifier) %&gt;%
  time_episodes(date, time_by = &quot;day&quot;, window = 28, switch_on_boundary = TRUE)

episodic_data %&gt;%
  arrange(identifier, date) %&gt;%
  mutate(index_visit = ep_id_new &gt; 0)
# A tibble: 13 &#215; 7
# Groups:   identifier [1]
   identifier date       t_elapsed ep_start   ep_id ep_id_new index_visit
   &lt;chr&gt;      &lt;date&gt;         &lt;dbl&gt; &lt;date&gt;     &lt;int&gt;     &lt;int&gt; &lt;lgl&gt;      
 1 A1         2020-01-28         0 2020-01-28     1         1 TRUE       
 2 A1         2020-04-01        64 2020-04-01     2         2 TRUE       
 3 A1         2020-08-16       137 2020-08-16     3         3 TRUE       
 4 A1         2020-08-20         4 2020-08-16     3         0 FALSE      
 5 A1         2020-08-30        10 2020-08-16     3         0 FALSE      
 6 A1         2020-09-15        16 2020-08-16     3         0 FALSE      
 7 A1         2020-10-31        46 2020-10-31     4         4 TRUE       
 8 A1         2020-11-14        14 2020-10-31     4         0 FALSE      
 9 A1         2020-11-26        12 2020-10-31     4         0 FALSE      
10 A1         2020-12-25        29 2020-12-25     5         5 TRUE       
11 A1         2021-05-04       130 2021-05-04     6         6 TRUE       
12 A1         2021-05-08         4 2021-05-04     6         0 FALSE      
13 A1         2021-07-26        79 2021-07-26     7         7 TRUE 

Alternative method

data %&gt;%
  group_by(identifier) %&gt;%
  arrange(identifier, date) %&gt;%
  mutate(elapsed = time_elapsed(date, &quot;days&quot;)) %&gt;%
  mutate(index_visit = row_number() == 1 | elapsed &gt;= 28)
# A tibble: 13 &#215; 4
# Groups:   identifier [1]
   identifier date       elapsed index_visit
   &lt;chr&gt;      &lt;date&gt;       &lt;dbl&gt; &lt;lgl&gt;      
 1 A1         2020-01-28      NA TRUE       
 2 A1         2020-04-01      64 TRUE       
 3 A1         2020-08-16     137 TRUE       
 4 A1         2020-08-20       4 FALSE      
 5 A1         2020-08-30      10 FALSE      
 6 A1         2020-09-15      16 FALSE      
 7 A1         2020-10-31      46 TRUE       
 8 A1         2020-11-14      14 FALSE      
 9 A1         2020-11-26      12 FALSE      
10 A1         2020-12-25      29 TRUE       
11 A1         2021-05-04     130 TRUE       
12 A1         2021-05-08       4 FALSE      
13 A1         2021-07-26      79 TRUE   

huangapple
  • 本文由 发表于 2023年8月8日 22:13:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76860414.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定