Complete missing, non-overlapping date ranges, truncated, variable start/end dates, indicator var

huangapple go评论68阅读模式
英文:

Complete missing, non-overlapping date ranges, truncated, variable start/end dates, indicator var

问题

我的问题是一个变种,参考了https://stackoverflow.com/questions/76015556/complete-missing-dates-based-on-start-and-end和https://stackoverflow.com/questions/76393814/complete-missing-non-overlapping-date-ranges。以下是我的初始数据:

have <- data.frame(ID = c(1, 1),
                 date1 = as.Date(c("1999-09-01", "2000-01-29")),
                 date2 = as.Date(c("2000-01-15", "2001-04-30")),
                 start = as.Date(c("2000-01-01", "2000-01-01")),
                 end = as.Date(c("2001-07-31", "2001-07-31")),
                 ind = c(1,1))

> have
  ID      date1      date2      start        end ind
1  1 1999-09-01 2000-01-15 2000-01-01 2001-07-31   1
2  1 2000-01-29 2001-04-30 2000-01-01 2001-07-31   1

我的期望输出如下:

want <- data.frame(ID = c(1, 1, 1, 1),
                 date1 = as.Date(c("2000-01-01", "2000-01-16", "2000-01-29", "2001-05-01")),
                 date2 = as.Date(c("2000-01-15", "2000-01-28", "2001-04-30", "2001-07-31")),
                 ind = c(1,0,1,0))

> want
  ID      date1      date2 ind
1  1 2000-01-01 2000-01-15   1
2  1 2000-01-16 2000-01-28   0
3  1 2000-01-29 2001-04-30   1
4  1 2001-05-01 2001-07-31   0

我尝试做的是:

  1. 在地板日期和天花板日期之间添加缺失的日期间隔(参考我的上一篇帖子)。
  2. 使用按组ID变化的日期地板和天花板。
  3. 排除给定地板日期之前的日期范围。
  4. 创建一个指示变量ind,跟踪起始数据中存在的日期范围(观察特征的时间=1)或不存在的日期范围(特征不观察的时间=0)。

到目前为止,我有来自链接帖子的以下代码,但输出结果不太符合我的要求:

library(dplyr)
ex <- have %>%
  group_by(ID) %>%
  summarize(newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)) %>%
  rename(date1 = newdate1, date2 = newdate2) %>%
  filter(date2 > date1) %>%
  ungroup() %>%
  bind_rows(have) %>%
  arrange(ID, date1)

> ex
# A tibble: 5 × 6
     ID date1      date2      start      end          ind
  <dbl> <date>     <date>     <date>     <date>     <dbl>
1     1 1999-09-01 2000-01-15 2000-01-01 2001-07-31     1
2     1 2000-01-01 2000-01-28 NA         NA            NA
3     1 2000-01-16 2001-07-31 NA         NA            NA
4     1 2000-01-29 2001-04-30 2000-01-01 2001-07-31     1
5     1 2001-05-01 2001-07-31 NA         NA            NA

谢谢。

更新:
重新编码区间,使其落在地板和天花板日期内是很简单的:

have2 <- have %>%
  group_by(ID) %>%
  mutate(date1new = ifelse(date1 < start & date2 > start, start, date1),
         date2new = ifelse(date2 > end & date1 < end, end, date2),
         date1new = as.Date(date1new, origin = "1970-01-01"),
         date2new = as.Date(date2new, origin = "1970-01-01")) %>%
  select(ID, (start:date2new)) %>%
  rename(date1 = date1new, date2 = date2new) %>%
  ungroup()

> have2
# A tibble: 2 × 6
     ID start      end          ind date1      date2     
  <dbl> <date>     <date>     <dbl> <date>     <date>    
1     1 2000-01-01 2001-07-31     1 2000-01-01 2000-01-15
2     1 2000-01-01 2001-07-31     1 2000-01-29 2001-04-30

使用示例脚本仍然无法获得我期望的输出。

ex2 <- have2 %>%
  group_by(ID) %>%
  summarize(newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)) %>%
  rename(date1 = newdate1, date2 = newdate2) %>%
  filter(date2 > date1) %>%
  ungroup() %>%
  bind_rows(have2) %>%
  arrange(ID, date1)

> ex2
# A tibble: 5 × 6
     ID date1      date2      start      end          ind
  <dbl> <date>     <date>     <date>     <date>     <dbl>
1     1 2000-01-01 2000-01-28 NA         NA            NA
2     1 2000-01-01 2000-01-15 2000-01-01 2001-07-31     1
3     1 2000-01-16 2001-07-31 NA         NA            NA
4     1 2000-01-29 2001-04-30 2000-01-01

<details>
<summary>英文:</summary>

My question is a variation on https://stackoverflow.com/questions/76015556/complete-missing-dates-based-on-start-and-end and https://stackoverflow.com/questions/76393814/complete-missing-non-overlapping-date-ranges. Here is my starting data:

have <- data.frame(ID = c(1, 1),
date1 = as.Date(c("1999-09-01", "2000-01-29")),
date2 = as.Date(c("2000-01-15", "2001-04-30")),
start = as.Date(c("2000-01-01", "2000-01-01")),
end = as.Date(c("2001-07-31", "2001-07-31")),
ind = c(1,1))

> have
ID date1 date2 start end ind
1 1 1999-09-01 2000-01-15 2000-01-01 2001-07-31 1
2 1 2000-01-29 2001-04-30 2000-01-01 2001-07-31 1


My desired output looks like this:

want <- data.frame(ID = c(1, 1, 1, 1),
date1 = as.Date(c("2000-01-01", "2000-01-16", "2000-01-29", "2001-05-01")),
date2 = as.Date(c("2000-01-15", "2000-01-28", "2001-04-30", "2001-07-31")),
ind = c(1,0,1,0))

> want
ID date1 date2 ind
1 1 2000-01-01 2000-01-15 1
2 1 2000-01-16 2000-01-28 0
3 1 2000-01-29 2001-04-30 1
4 1 2001-05-01 2001-07-31 0

What I&#39;m trying to do is: 
1. Add lines with the missing date intervals between a floor and ceiling date (my previous post)
2. Use date floor and ceiling which vary by group ID
3. Exclude date ranges before the given floor date
4. Create an indicator variable `ind` keeping track of date ranges present in the starting data (time when feature is observed = 1) or not (time when feature is not observed = 0)

So far I have this code from the linked posts but the output is not quite what I&#39;m going for:

library(dplyr)
ex <- have %>%
group_by(ID) %>%
summarize(newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)) %>%
rename(date1 = newdate1, date2 = newdate2) %>%
filter(date2 > date1) %>%
ungroup() %>%
bind_rows(have) %>%
arrange(ID, date1)

> ex

A tibble: 5 × 6

 ID date1      date2      start      end          ind

<dbl> <date> <date> <date> <date> <dbl>
1 1 1999-09-01 2000-01-15 2000-01-01 2001-07-31 1
2 1 2000-01-01 2000-01-28 NA NA NA
3 1 2000-01-16 2001-07-31 NA NA NA
4 1 2000-01-29 2001-04-30 2000-01-01 2001-07-31 1
5 1 2001-05-01 2001-07-31 NA NA NA

Thank you. 

**Update:**
Recoding the intervals so they fall within the floor and ceiling dates is simple enough:

have2 <- have %>%
group_by(ID) %>%
mutate(date1new = ifelse(date1<start & date2>start,start,date1),
date2new = ifelse(date2>end & date1<end, end, date2),
date1new = as.Date(date1new, origin = "1970-01-01"),
date2new = as.Date(date2new, origin = "1970-01-01")) %>%
select(ID, (start:date2new)) %>%
rename(date1 = date1new, date2 = date2new) %>%
ungroup()

> have2

A tibble: 2 × 6

 ID start      end          ind date1      date2     

<dbl> <date> <date> <dbl> <date> <date>
1 1 2000-01-01 2001-07-31 1 2000-01-01 2000-01-15
2 1 2000-01-01 2001-07-31 1 2000-01-29 2001-04-30

I still can&#39;t get my desired output using the example script.

ex2 <- have2 %>%
group_by(ID) %>%
summarize(newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)) %>%
rename(date1 = newdate1, date2 = newdate2) %>%
filter(date2 > date1) %>%
ungroup() %>%
bind_rows(have2) %>%
arrange(ID, date1)

> ex2

A tibble: 5 × 6

 ID date1      date2      start      end          ind

<dbl> <date> <date> <date> <date> <dbl>
1 1 2000-01-01 2000-01-28 NA NA NA
2 1 2000-01-01 2000-01-15 2000-01-01 2001-07-31 1
3 1 2000-01-16 2001-07-31 NA NA NA
4 1 2000-01-29 2001-04-30 2000-01-01 2001-07-31 1
5 1 2001-05-01 2001-07-31 NA NA NA


</details>


# 答案1
**得分**: 0

你差不多理解了。主要错误是将 `start` 和 `end` 的值传递为列,而不是值,如下所示:

```R
newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)

这样,它将整个 start 列添加到开头,而不只是一个值。这就是为什么你的输出多了一些行。你想要做的是类似于 start[1]

除此之外,我们可以做一些小的改进。

  1. summarize 中创建一个 id = 0 列;
  2. 使你的 mutate 更加紧凑;
  3. 使用筛选器完全删除那些日期既在 start 之前又在 end 之后的行;
  4. bind_rows() 之前取消选择 startend 列。

结果:

# A tibble: 4 × 4
     ID date1      date2        ind
  <dbl> <date>     <date>     <dbl>
1     1 2000-01-01 2000-01-15     1
2     1 2000-01-16 2000-01-28     0
3     1 2000-01-29 2001-04-30     1
4     1 2001-05-01 2001-07-31     0

替代的 mutate

如果你知道你的数据总是按日期递增的,你可以只更改 date1 的第一个值和 date2 的最后一个值:

mutate(date1 = c(if(date1[1] < start[1]) start[1] else date1[1], date1[-1]),
       date2 = c(date2[-n()], if(date2[n()] > end[1]) end[1] else date2[n()]))

这样,你可以避免不必要的检查,同时不会丢失日期格式。

英文:

You almost got it. The main mistake was passing the start and end values as a column, not as a value, in

newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)

This way, it's adding the whole start column at the beginning, not just one value. That's why your output had extra rows. You wanted to do something like start[1].

Besides that, we can do some minor improvements.

  1. Create an id = 0 column in the summarize;
  2. Make your mutate more compact;
  3. Entirely remove rows where both dates fall before the start of after the end with a filter;
  4. Deselect start and end columns before the bind_rows().
have2 &lt;- have %&gt;%
  group_by(ID) %&gt;%
  filter(! (date1 &lt; start &amp; date2 &lt; start) | (date1 &gt; end &amp; date2 &gt; end)) %&gt;% # Item 3
  mutate(date1 = ifelse(date1&lt;start,start,date1) %&gt;% as.Date(origin = &quot;1970-01-01&quot;),
         date2 = ifelse(date2&gt;end, end, date2)  %&gt;% as.Date(origin = &quot;1970-01-01&quot;)) # Item 2
# You don&#39;t necessarily need to ungroup, as we want grouped data for the summarize

have2 %&gt;%
  summarize(newdate1 = c(start[1], date2+1),
            newdate2 = c(date1-1, end[1]),
            ind = 0) %&gt;% # Item 1
  rename(date1 = newdate1, date2 = newdate2) %&gt;%
  filter(date2 &gt; date1) %&gt;%
  ungroup() %&gt;%
  bind_rows(select(have2, -c(start, end))) %&gt;% # Item 4
  arrange(ID, date1)

Result:

# A tibble: 4 &#215; 4
     ID date1      date2        ind
  &lt;dbl&gt; &lt;date&gt;     &lt;date&gt;     &lt;dbl&gt;
1     1 2000-01-01 2000-01-15     1
2     1 2000-01-16 2000-01-28     0
3     1 2000-01-29 2001-04-30     1
4     1 2001-05-01 2001-07-31     0

Alternative mutate:

If you know that your data is always in increasing dates, you can just change the first value of date1, and the last of date2:

mutate(date1 = c(if(date1[1] &lt; start[1]) start[1] else date1[1], date1[-1]),
       date2 = c(date2[-n()], if(date2[n()] &gt; end[1]) end[1] else date2[n()]))

This way you avoid unnecessary checks and don't lose the date formatting.

huangapple
  • 本文由 发表于 2023年6月8日 06:10:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76427407.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定