英文:
Identify index presentations and re-attendances within a 28 day period
问题
我的数据集记录了个人对某个地点的演示。下面是表格形式的记录,但也包含了下方的dput格式。
标识符 | 日期 |
---|---|
"A1" | "28/01/2020" |
"A1" | "01/04/2020" |
"A1" | "16/08/2020" |
"A1" | "20/08/2020" |
"A1" | "30/08/2020" |
"A1" | "31/10/2020" |
"A1" | "14/11/2020" |
"A1" | "26/11/2020" |
"A1" | "25/12/2020" |
"A1" | "04/05/2021" |
"A1" | "08/05/2021" |
"A1" | "26/07/2021" |
个人的出席是零散的,并且有时在28天内多次返回。
按日期顺序工作,我想找出至少相隔28天的首次访问和在28天窗口内再次访问的情况。
正确处理后的结果应该如下所示:
标识符 | 日期 | 首次访问 | 首次访问ID |
---|---|---|---|
"A1" | "28/01/2020" | TRUE | 1 |
"A1" | "01/04/2020" | TRUE | 2 |
"A1" | "16/08/2020" | TRUE | 3 |
"A1" | "20/08/2020" | FALSE | 3 |
"A1" | "30/08/2020" | FALSE | 3 |
"A1" | "31/10/2020" | TRUE | 4 |
"A1" | "14/11/2020" | FALSE | 4 |
"A1" | "26/11/2020" | FALSE | 4 |
"A1" | "25/12/2020" | TRUE | 5 |
"A1" | "04/05/2021" | TRUE | 6 |
"A1" | "08/05/2021" | FALSE | 6 |
"A1" | "26/07/2021" | TRUE | 7 |
我更喜欢使用dplyr来解决问题,因为我对它最熟悉。
Dput:
data <- structure(list(identifier = c("A1", "A1", "A1", "A1", "A1", "A1",
"A1", "A1", "A1", "A1", "A1", "A1", "A1"),
date = structure(c(18520, 18504, 18621, 18580, 18353, 18751,
18289, 18494, 18592, 18490, 18755, 18834, 18566),
class = "Date")),
row.names = c(NA, -13L), class = "data.frame")
英文:
My dataset records presentations to a location by an individual. This is tabulated below but included as a dput further down.
Identifier | date |
---|---|
"A1" | "28/01/2020" |
"A1" | "01/04/2020" |
"A1" | "16/08/2020" |
"A1" | "20/08/2020" |
"A1" | "30/08/2020" |
"A1" | "31/10/2020" |
"A1" | "14/11/2020" |
"A1" | "26/11/2020" |
"A1" | "25/12/2020" |
"A1" | "04/05/2021" |
"A1" | "08/05/2021" |
"A1" | "26/07/2021" |
The individual attends sporadically and on occasions returns several times within a 28 day period.
Working in date order I want to identify index cases that occur at least 28 days apart and reattendances that occur within the 28 day window.
The resulting data when processed correctly would look like this:
Identifier | date | index_visit | index_id |
---|---|---|---|
"A1" | "28/01/2020" | TRUE | 1 |
"A1" | "01/04/2020" | TRUE | 2 |
"A1" | "16/08/2020" | TRUE | 3 |
"A1" | "20/08/2020" | FALSE | 3 |
"A1" | "30/08/2020" | FALSE | 3 |
"A1" | "31/10/2020" | TRUE | 4 |
"A1" | "14/11/2020" | FALSE | 4 |
"A1" | "26/11/2020" | FALSE | 4 |
"A1" | "25/12/2020" | TRUE | 5 |
"A1" | "04/05/2021" | TRUE | 6 |
"A1" | "08/05/2021" | FALSE | 6 |
"A1" | "26/07/2021" | TRUE | 7 |
I would prefer solutions that make use of dplyr as this is what I am most familiar with.
Dput:
data <- structure(list(identifier = c("A1", "A1", "A1", "A1", "A1", "A1",
"A1", "A1", "A1", "A1", "A1", "A1", "A1"),
date = structure(c(18520, 18504, 18621, 18580, 18353, 18751,
18289, 18494, 18592, 18490, 18755, 18834, 18566),
class = "Date")),
row.names = c(NA, -13L), class = "data.frame")
答案1
得分: 1
使用dplyr
包,你可以使用lag()
函数计算日期差异,然后使用cumsum
函数累加变化的次数。
library(dplyr)
data %>%
arrange(date) %>%
mutate(
index_visit = difftime(date, lag(default = as.Date("1900-01-01"), date), units = "day") > 28,
index_id = cumsum(index_visit)
)
# identifier date index_visit index_id
# 1 A1 2020-01-28 TRUE 1
# 2 A1 2020-04-01 TRUE 2
# 3 A1 2020-08-16 TRUE 3
# 4 A1 2020-08-20 FALSE 3
# 5 A1 2020-08-30 FALSE 3
# 6 A1 2020-09-15 FALSE 3
# 7 A1 2020-10-31 TRUE 4
# 8 A1 2020-11-14 FALSE 4
# 9 A1 2020-11-26 FALSE 4
# 10 A1 2020-12-25 TRUE 5
# 11 A1 2021-05-04 TRUE 6
# 12 A1 2021-05-08 FALSE 6
# 13 A1 2021-07-26 TRUE 7
因此,第一行没有前一个值,我使用了1900年1月1日作为填充值,以便第一个日期成为索引访问。
英文:
With dplyr
you can calculate the date difference with lag()
and then sum up the number of times it changes with cumsum
library(dplyr)
data %>%
arrange(date) %>%
mutate(
index_visit=difftime(date, lag(default = as.Date("1900-01-01"), date), units = "day")>28,
index_id = cumsum(index_visit)
)
# identifier date index_visit index_id
# 1 A1 2020-01-28 TRUE 1
# 2 A1 2020-04-01 TRUE 2
# 3 A1 2020-08-16 TRUE 3
# 4 A1 2020-08-20 FALSE 3
# 5 A1 2020-08-30 FALSE 3
# 6 A1 2020-09-15 FALSE 3
# 7 A1 2020-10-31 TRUE 4
# 8 A1 2020-11-14 FALSE 4
# 9 A1 2020-11-26 FALSE 4
# 10 A1 2020-12-25 TRUE 5
# 11 A1 2021-05-04 TRUE 6
# 12 A1 2021-05-08 FALSE 6
# 13 A1 2021-07-26 TRUE 7
So the first row doesn't have a lag value, I used Jan 1 1900 as a filler so the first date would be an index visit
答案2
得分: 1
我的函数time_episodes()
最初是用来识别疾病再感染的情况,但适用于任何分段事件分析。
它还通过lubridate
支持更复杂的时间单位,如月份。
# Uncomment to install
# remotes::install_github("NicChr/timeplyr")
library(tidyverse)
library(timeplyr)
data <- as_tibble(data)
episodic_data <- data %>%
group_by(identifier) %>%
time_episodes(date, time_by = "day", window = 28, switch_on_boundary = TRUE)
episodic_data %>%
arrange(identifier, date) %>%
mutate(index_visit = ep_id_new > 0)
# A tibble: 13 × 7
# Groups: identifier [1]
identifier date t_elapsed ep_start ep_id ep_id_new index_visit
<chr> <date> <dbl> <date> <int> <int> <lgl>
1 A1 2020-01-28 0 2020-01-28 1 1 TRUE
2 A1 2020-04-01 64 2020-04-01 2 2 TRUE
3 A1 2020-08-16 137 2020-08-16 3 3 TRUE
4 A1 2020-08-20 4 2020-08-16 3 0 FALSE
5 A1 2020-08-30 10 2020-08-16 3 0 FALSE
6 A1 2020-09-15 16 2020-08-16 3 0 FALSE
7 A1 2020-10-31 46 2020-10-31 4 4 TRUE
8 A1 2020-11-14 14 2020-10-31 4 0 FALSE
9 A1 2020-11-26 12 2020-10-31 4 0 FALSE
10 A1 2020-12-25 29 2020-12-25 5 5 TRUE
11 A1 2021-05-04 130 2021-05-04 6 6 TRUE
12 A1 2021-05-08 4 2021-05-04 6 0 FALSE
13 A1 2021-07-26 79 2021-07-26 7 7 TRUE
替代方法
data %>%
group_by(identifier) %>%
arrange(identifier, date) %>%
mutate(elapsed = time_elapsed(date, "days")) %>%
mutate(index_visit = row_number() == 1 | elapsed >= 28)
# A tibble: 13 × 4
# Groups: identifier [1]
identifier date elapsed index_visit
<chr> <date> <dbl> <lgl>
1 A1 2020-01-28 NA TRUE
2 A1 2020-04-01 64 TRUE
3 A1 2020-08-16 137 TRUE
4 A1 2020-08-20 4 FALSE
5 A1 2020-08-30 10 FALSE
6 A1 2020-09-15 16 FALSE
7 A1 2020-10-31 46 TRUE
8 A1 2020-11-14 14 FALSE
9 A1 2020-11-26 12 FALSE
10 A1 2020-12-25 29 TRUE
11 A1 2021-05-04 130 TRUE
12 A1 2021-05-08 4 FALSE
13 A1 2021-07-26 79 TRUE
英文:
My function time_episodes()
was originally written to identify episodes of disease reinfection but is applicable to any episodic event analysis.
It also supports more complex time units like months through lubridate
.
# Uncomment to install
# remotes::install_github("NicChr/timeplyr")
library(tidyverse)
library(timeplyr)
data <- as_tibble(data)
episodic_data <- data %>%
group_by(identifier) %>%
time_episodes(date, time_by = "day", window = 28, switch_on_boundary = TRUE)
episodic_data %>%
arrange(identifier, date) %>%
mutate(index_visit = ep_id_new > 0)
# A tibble: 13 × 7
# Groups: identifier [1]
identifier date t_elapsed ep_start ep_id ep_id_new index_visit
<chr> <date> <dbl> <date> <int> <int> <lgl>
1 A1 2020-01-28 0 2020-01-28 1 1 TRUE
2 A1 2020-04-01 64 2020-04-01 2 2 TRUE
3 A1 2020-08-16 137 2020-08-16 3 3 TRUE
4 A1 2020-08-20 4 2020-08-16 3 0 FALSE
5 A1 2020-08-30 10 2020-08-16 3 0 FALSE
6 A1 2020-09-15 16 2020-08-16 3 0 FALSE
7 A1 2020-10-31 46 2020-10-31 4 4 TRUE
8 A1 2020-11-14 14 2020-10-31 4 0 FALSE
9 A1 2020-11-26 12 2020-10-31 4 0 FALSE
10 A1 2020-12-25 29 2020-12-25 5 5 TRUE
11 A1 2021-05-04 130 2021-05-04 6 6 TRUE
12 A1 2021-05-08 4 2021-05-04 6 0 FALSE
13 A1 2021-07-26 79 2021-07-26 7 7 TRUE
Alternative method
data %>%
group_by(identifier) %>%
arrange(identifier, date) %>%
mutate(elapsed = time_elapsed(date, "days")) %>%
mutate(index_visit = row_number() == 1 | elapsed >= 28)
# A tibble: 13 × 4
# Groups: identifier [1]
identifier date elapsed index_visit
<chr> <date> <dbl> <lgl>
1 A1 2020-01-28 NA TRUE
2 A1 2020-04-01 64 TRUE
3 A1 2020-08-16 137 TRUE
4 A1 2020-08-20 4 FALSE
5 A1 2020-08-30 10 FALSE
6 A1 2020-09-15 16 FALSE
7 A1 2020-10-31 46 TRUE
8 A1 2020-11-14 14 FALSE
9 A1 2020-11-26 12 FALSE
10 A1 2020-12-25 29 TRUE
11 A1 2021-05-04 130 TRUE
12 A1 2021-05-08 4 FALSE
13 A1 2021-07-26 79 TRUE
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论