英文:
Complete missing, non-overlapping date ranges, truncated, variable start/end dates, indicator var
问题
我的问题是一个变种,参考了https://stackoverflow.com/questions/76015556/complete-missing-dates-based-on-start-and-end和https://stackoverflow.com/questions/76393814/complete-missing-non-overlapping-date-ranges。以下是我的初始数据:
have <- data.frame(ID = c(1, 1),
date1 = as.Date(c("1999-09-01", "2000-01-29")),
date2 = as.Date(c("2000-01-15", "2001-04-30")),
start = as.Date(c("2000-01-01", "2000-01-01")),
end = as.Date(c("2001-07-31", "2001-07-31")),
ind = c(1,1))
> have
ID date1 date2 start end ind
1 1 1999-09-01 2000-01-15 2000-01-01 2001-07-31 1
2 1 2000-01-29 2001-04-30 2000-01-01 2001-07-31 1
我的期望输出如下:
want <- data.frame(ID = c(1, 1, 1, 1),
date1 = as.Date(c("2000-01-01", "2000-01-16", "2000-01-29", "2001-05-01")),
date2 = as.Date(c("2000-01-15", "2000-01-28", "2001-04-30", "2001-07-31")),
ind = c(1,0,1,0))
> want
ID date1 date2 ind
1 1 2000-01-01 2000-01-15 1
2 1 2000-01-16 2000-01-28 0
3 1 2000-01-29 2001-04-30 1
4 1 2001-05-01 2001-07-31 0
我尝试做的是:
- 在地板日期和天花板日期之间添加缺失的日期间隔(参考我的上一篇帖子)。
- 使用按组ID变化的日期地板和天花板。
- 排除给定地板日期之前的日期范围。
- 创建一个指示变量
ind
,跟踪起始数据中存在的日期范围(观察特征的时间=1)或不存在的日期范围(特征不观察的时间=0)。
到目前为止,我有来自链接帖子的以下代码,但输出结果不太符合我的要求:
library(dplyr)
ex <- have %>%
group_by(ID) %>%
summarize(newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)) %>%
rename(date1 = newdate1, date2 = newdate2) %>%
filter(date2 > date1) %>%
ungroup() %>%
bind_rows(have) %>%
arrange(ID, date1)
> ex
# A tibble: 5 × 6
ID date1 date2 start end ind
<dbl> <date> <date> <date> <date> <dbl>
1 1 1999-09-01 2000-01-15 2000-01-01 2001-07-31 1
2 1 2000-01-01 2000-01-28 NA NA NA
3 1 2000-01-16 2001-07-31 NA NA NA
4 1 2000-01-29 2001-04-30 2000-01-01 2001-07-31 1
5 1 2001-05-01 2001-07-31 NA NA NA
谢谢。
更新:
重新编码区间,使其落在地板和天花板日期内是很简单的:
have2 <- have %>%
group_by(ID) %>%
mutate(date1new = ifelse(date1 < start & date2 > start, start, date1),
date2new = ifelse(date2 > end & date1 < end, end, date2),
date1new = as.Date(date1new, origin = "1970-01-01"),
date2new = as.Date(date2new, origin = "1970-01-01")) %>%
select(ID, (start:date2new)) %>%
rename(date1 = date1new, date2 = date2new) %>%
ungroup()
> have2
# A tibble: 2 × 6
ID start end ind date1 date2
<dbl> <date> <date> <dbl> <date> <date>
1 1 2000-01-01 2001-07-31 1 2000-01-01 2000-01-15
2 1 2000-01-01 2001-07-31 1 2000-01-29 2001-04-30
使用示例脚本仍然无法获得我期望的输出。
ex2 <- have2 %>%
group_by(ID) %>%
summarize(newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)) %>%
rename(date1 = newdate1, date2 = newdate2) %>%
filter(date2 > date1) %>%
ungroup() %>%
bind_rows(have2) %>%
arrange(ID, date1)
> ex2
# A tibble: 5 × 6
ID date1 date2 start end ind
<dbl> <date> <date> <date> <date> <dbl>
1 1 2000-01-01 2000-01-28 NA NA NA
2 1 2000-01-01 2000-01-15 2000-01-01 2001-07-31 1
3 1 2000-01-16 2001-07-31 NA NA NA
4 1 2000-01-29 2001-04-30 2000-01-01
<details>
<summary>英文:</summary>
My question is a variation on https://stackoverflow.com/questions/76015556/complete-missing-dates-based-on-start-and-end and https://stackoverflow.com/questions/76393814/complete-missing-non-overlapping-date-ranges. Here is my starting data:
have <- data.frame(ID = c(1, 1),
date1 = as.Date(c("1999-09-01", "2000-01-29")),
date2 = as.Date(c("2000-01-15", "2001-04-30")),
start = as.Date(c("2000-01-01", "2000-01-01")),
end = as.Date(c("2001-07-31", "2001-07-31")),
ind = c(1,1))
> have
ID date1 date2 start end ind
1 1 1999-09-01 2000-01-15 2000-01-01 2001-07-31 1
2 1 2000-01-29 2001-04-30 2000-01-01 2001-07-31 1
My desired output looks like this:
want <- data.frame(ID = c(1, 1, 1, 1),
date1 = as.Date(c("2000-01-01", "2000-01-16", "2000-01-29", "2001-05-01")),
date2 = as.Date(c("2000-01-15", "2000-01-28", "2001-04-30", "2001-07-31")),
ind = c(1,0,1,0))
> want
ID date1 date2 ind
1 1 2000-01-01 2000-01-15 1
2 1 2000-01-16 2000-01-28 0
3 1 2000-01-29 2001-04-30 1
4 1 2001-05-01 2001-07-31 0
What I'm trying to do is:
1. Add lines with the missing date intervals between a floor and ceiling date (my previous post)
2. Use date floor and ceiling which vary by group ID
3. Exclude date ranges before the given floor date
4. Create an indicator variable `ind` keeping track of date ranges present in the starting data (time when feature is observed = 1) or not (time when feature is not observed = 0)
So far I have this code from the linked posts but the output is not quite what I'm going for:
library(dplyr)
ex <- have %>%
group_by(ID) %>%
summarize(newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)) %>%
rename(date1 = newdate1, date2 = newdate2) %>%
filter(date2 > date1) %>%
ungroup() %>%
bind_rows(have) %>%
arrange(ID, date1)
> ex
A tibble: 5 × 6
ID date1 date2 start end ind
<dbl> <date> <date> <date> <date> <dbl>
1 1 1999-09-01 2000-01-15 2000-01-01 2001-07-31 1
2 1 2000-01-01 2000-01-28 NA NA NA
3 1 2000-01-16 2001-07-31 NA NA NA
4 1 2000-01-29 2001-04-30 2000-01-01 2001-07-31 1
5 1 2001-05-01 2001-07-31 NA NA NA
Thank you.
**Update:**
Recoding the intervals so they fall within the floor and ceiling dates is simple enough:
have2 <- have %>%
group_by(ID) %>%
mutate(date1new = ifelse(date1<start & date2>start,start,date1),
date2new = ifelse(date2>end & date1<end, end, date2),
date1new = as.Date(date1new, origin = "1970-01-01"),
date2new = as.Date(date2new, origin = "1970-01-01")) %>%
select(ID, (start:date2new)) %>%
rename(date1 = date1new, date2 = date2new) %>%
ungroup()
> have2
A tibble: 2 × 6
ID start end ind date1 date2
<dbl> <date> <date> <dbl> <date> <date>
1 1 2000-01-01 2001-07-31 1 2000-01-01 2000-01-15
2 1 2000-01-01 2001-07-31 1 2000-01-29 2001-04-30
I still can't get my desired output using the example script.
ex2 <- have2 %>%
group_by(ID) %>%
summarize(newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)) %>%
rename(date1 = newdate1, date2 = newdate2) %>%
filter(date2 > date1) %>%
ungroup() %>%
bind_rows(have2) %>%
arrange(ID, date1)
> ex2
A tibble: 5 × 6
ID date1 date2 start end ind
<dbl> <date> <date> <date> <date> <dbl>
1 1 2000-01-01 2000-01-28 NA NA NA
2 1 2000-01-01 2000-01-15 2000-01-01 2001-07-31 1
3 1 2000-01-16 2001-07-31 NA NA NA
4 1 2000-01-29 2001-04-30 2000-01-01 2001-07-31 1
5 1 2001-05-01 2001-07-31 NA NA NA
</details>
# 答案1
**得分**: 0
你差不多理解了。主要错误是将 `start` 和 `end` 的值传递为列,而不是值,如下所示:
```R
newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)
这样,它将整个 start
列添加到开头,而不只是一个值。这就是为什么你的输出多了一些行。你想要做的是类似于 start[1]
。
除此之外,我们可以做一些小的改进。
- 在
summarize
中创建一个id = 0
列; - 使你的
mutate
更加紧凑; - 使用筛选器完全删除那些日期既在
start
之前又在end
之后的行; - 在
bind_rows()
之前取消选择start
和end
列。
结果:
# A tibble: 4 × 4
ID date1 date2 ind
<dbl> <date> <date> <dbl>
1 1 2000-01-01 2000-01-15 1
2 1 2000-01-16 2000-01-28 0
3 1 2000-01-29 2001-04-30 1
4 1 2001-05-01 2001-07-31 0
替代的 mutate
:
如果你知道你的数据总是按日期递增的,你可以只更改 date1 的第一个值和 date2 的最后一个值:
mutate(date1 = c(if(date1[1] < start[1]) start[1] else date1[1], date1[-1]),
date2 = c(date2[-n()], if(date2[n()] > end[1]) end[1] else date2[n()]))
这样,你可以避免不必要的检查,同时不会丢失日期格式。
英文:
You almost got it. The main mistake was passing the start
and end
values as a column, not as a value, in
newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)
This way, it's adding the whole start
column at the beginning, not just one value. That's why your output had extra rows. You wanted to do something like start[1]
.
Besides that, we can do some minor improvements.
- Create an
id = 0
column in thesummarize
; - Make your
mutate
more compact; - Entirely remove rows where both dates fall before the
start
of after theend
with a filter; - Deselect
start
andend
columns before thebind_rows()
.
have2 <- have %>%
group_by(ID) %>%
filter(! (date1 < start & date2 < start) | (date1 > end & date2 > end)) %>% # Item 3
mutate(date1 = ifelse(date1<start,start,date1) %>% as.Date(origin = "1970-01-01"),
date2 = ifelse(date2>end, end, date2) %>% as.Date(origin = "1970-01-01")) # Item 2
# You don't necessarily need to ungroup, as we want grouped data for the summarize
have2 %>%
summarize(newdate1 = c(start[1], date2+1),
newdate2 = c(date1-1, end[1]),
ind = 0) %>% # Item 1
rename(date1 = newdate1, date2 = newdate2) %>%
filter(date2 > date1) %>%
ungroup() %>%
bind_rows(select(have2, -c(start, end))) %>% # Item 4
arrange(ID, date1)
Result:
# A tibble: 4 × 4
ID date1 date2 ind
<dbl> <date> <date> <dbl>
1 1 2000-01-01 2000-01-15 1
2 1 2000-01-16 2000-01-28 0
3 1 2000-01-29 2001-04-30 1
4 1 2001-05-01 2001-07-31 0
Alternative mutate
:
If you know that your data is always in increasing dates, you can just change the first value of date1, and the last of date2:
mutate(date1 = c(if(date1[1] < start[1]) start[1] else date1[1], date1[-1]),
date2 = c(date2[-n()], if(date2[n()] > end[1]) end[1] else date2[n()]))
This way you avoid unnecessary checks and don't lose the date formatting.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论