2023年6月8日 06:10:35go评论88阅读模式

英文:

Complete missing, non-overlapping date ranges, truncated, variable start/end dates, indicator var

问题

我的问题是一个变种，参考了https://stackoverflow.com/questions/76015556/complete-missing-dates-based-on-start-and-end和https://stackoverflow.com/questions/76393814/complete-missing-non-overlapping-date-ranges。以下是我的初始数据：

have <- data.frame(ID = c(1, 1),
                 date1 = as.Date(c("1999-09-01", "2000-01-29")),
                 date2 = as.Date(c("2000-01-15", "2001-04-30")),
                 start = as.Date(c("2000-01-01", "2000-01-01")),
                 end = as.Date(c("2001-07-31", "2001-07-31")),
                 ind = c(1,1))
> have
  ID      date1      date2      start        end ind
1  1 1999-09-01 2000-01-15 2000-01-01 2001-07-31   1
2  1 2000-01-29 2001-04-30 2000-01-01 2001-07-31   1

我的期望输出如下：

want <- data.frame(ID = c(1, 1, 1, 1),
                 date1 = as.Date(c("2000-01-01", "2000-01-16", "2000-01-29", "2001-05-01")),
                 date2 = as.Date(c("2000-01-15", "2000-01-28", "2001-04-30", "2001-07-31")),
                 ind = c(1,0,1,0))
> want
  ID      date1      date2 ind
1  1 2000-01-01 2000-01-15   1
2  1 2000-01-16 2000-01-28   0
3  1 2000-01-29 2001-04-30   1
4  1 2001-05-01 2001-07-31   0

我尝试做的是：

在地板日期和天花板日期之间添加缺失的日期间隔（参考我的上一篇帖子）。
使用按组ID变化的日期地板和天花板。
排除给定地板日期之前的日期范围。
创建一个指示变量ind，跟踪起始数据中存在的日期范围（观察特征的时间=1）或不存在的日期范围（特征不观察的时间=0）。

到目前为止，我有来自链接帖子的以下代码，但输出结果不太符合我的要求：

library(dplyr)
ex <- have %>%
  group_by(ID) %>%
  summarize(newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)) %>%
  rename(date1 = newdate1, date2 = newdate2) %>%
  filter(date2 > date1) %>%
  ungroup() %>%
  bind_rows(have) %>%
  arrange(ID, date1)
> ex
# A tibble: 5 × 6
     ID date1      date2      start      end          ind
  <dbl> <date>     <date>     <date>     <date>     <dbl>
1     1 1999-09-01 2000-01-15 2000-01-01 2001-07-31     1
2     1 2000-01-01 2000-01-28 NA         NA            NA
3     1 2000-01-16 2001-07-31 NA         NA            NA
4     1 2000-01-29 2001-04-30 2000-01-01 2001-07-31     1
5     1 2001-05-01 2001-07-31 NA         NA            NA

谢谢。

更新：
重新编码区间，使其落在地板和天花板日期内是很简单的：

have2 <- have %>%
  group_by(ID) %>%
  mutate(date1new = ifelse(date1 < start & date2 > start, start, date1),
         date2new = ifelse(date2 > end & date1 < end, end, date2),
         date1new = as.Date(date1new, origin = "1970-01-01"),
         date2new = as.Date(date2new, origin = "1970-01-01")) %>%
  select(ID, (start:date2new)) %>%
  rename(date1 = date1new, date2 = date2new) %>%
  ungroup()
> have2
# A tibble: 2 × 6
     ID start      end          ind date1      date2     
  <dbl> <date>     <date>     <dbl> <date>     <date>    
1     1 2000-01-01 2001-07-31     1 2000-01-01 2000-01-15
2     1 2000-01-01 2001-07-31     1 2000-01-29 2001-04-30

使用示例脚本仍然无法获得我期望的输出。

ex2 <- have2 %>%
  group_by(ID) %>%
  summarize(newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)) %>%
  rename(date1 = newdate1, date2 = newdate2) %>%
  filter(date2 > date1) %>%
  ungroup() %>%
  bind_rows(have2) %>%
  arrange(ID, date1)
> ex2
# A tibble: 5 × 6
     ID date1      date2      start      end          ind
  <dbl> <date>     <date>     <date>     <date>     <dbl>
1     1 2000-01-01 2000-01-28 NA         NA            NA
2     1 2000-01-01 2000-01-15 2000-01-01 2001-07-31     1
3     1 2000-01-16 2001-07-31 NA         NA            NA
4     1 2000-01-29 2001-04-30 2000-01-01
<details>
<summary>英文:</summary>
My question is a variation on https://stackoverflow.com/questions/76015556/complete-missing-dates-based-on-start-and-end and https://stackoverflow.com/questions/76393814/complete-missing-non-overlapping-date-ranges. Here is my starting data:

have <- data.frame(ID = c(1, 1),
date1 = as.Date(c("1999-09-01", "2000-01-29")),
date2 = as.Date(c("2000-01-15", "2001-04-30")),
start = as.Date(c("2000-01-01", "2000-01-01")),
end = as.Date(c("2001-07-31", "2001-07-31")),
ind = c(1,1))

> have
ID date1 date2 start end ind
1 1 1999-09-01 2000-01-15 2000-01-01 2001-07-31 1
2 1 2000-01-29 2001-04-30 2000-01-01 2001-07-31 1


My desired output looks like this:

want <- data.frame(ID = c(1, 1, 1, 1),
date1 = as.Date(c("2000-01-01", "2000-01-16", "2000-01-29", "2001-05-01")),
date2 = as.Date(c("2000-01-15", "2000-01-28", "2001-04-30", "2001-07-31")),
ind = c(1,0,1,0))

> want
ID date1 date2 ind
1 1 2000-01-01 2000-01-15 1
2 1 2000-01-16 2000-01-28 0
3 1 2000-01-29 2001-04-30 1
4 1 2001-05-01 2001-07-31 0

What I&#39;m trying to do is: 
1. Add lines with the missing date intervals between a floor and ceiling date (my previous post)
2. Use date floor and ceiling which vary by group ID
3. Exclude date ranges before the given floor date
4. Create an indicator variable `ind` keeping track of date ranges present in the starting data (time when feature is observed = 1) or not (time when feature is not observed = 0)
So far I have this code from the linked posts but the output is not quite what I&#39;m going for:

library(dplyr)
ex <- have %>%
group_by(ID) %>%
summarize(newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)) %>%
rename(date1 = newdate1, date2 = newdate2) %>%
filter(date2 > date1) %>%
ungroup() %>%
bind_rows(have) %>%
arrange(ID, date1)

> ex

A tibble: 5 × 6

 ID date1      date2      start      end          ind

<dbl> <date> <date> <date> <date> <dbl>
1 1 1999-09-01 2000-01-15 2000-01-01 2001-07-31 1
2 1 2000-01-01 2000-01-28 NA NA NA
3 1 2000-01-16 2001-07-31 NA NA NA
4 1 2000-01-29 2001-04-30 2000-01-01 2001-07-31 1
5 1 2001-05-01 2001-07-31 NA NA NA

Thank you. 
**Update:**
Recoding the intervals so they fall within the floor and ceiling dates is simple enough:

have2 <- have %>%
group_by(ID) %>%
mutate(date1new = ifelse(date1<start & date2>start,start,date1),
date2new = ifelse(date2>end & date1<end, end, date2),
date1new = as.Date(date1new, origin = "1970-01-01"),
date2new = as.Date(date2new, origin = "1970-01-01")) %>%
select(ID, (start:date2new)) %>%
rename(date1 = date1new, date2 = date2new) %>%
ungroup()

> have2

A tibble: 2 × 6

 ID start      end          ind date1      date2

<dbl> <date> <date> <dbl> <date> <date>
1 1 2000-01-01 2001-07-31 1 2000-01-01 2000-01-15
2 1 2000-01-01 2001-07-31 1 2000-01-29 2001-04-30

I still can&#39;t get my desired output using the example script.

ex2 <- have2 %>%
group_by(ID) %>%
summarize(newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)) %>%
rename(date1 = newdate1, date2 = newdate2) %>%
filter(date2 > date1) %>%
ungroup() %>%
bind_rows(have2) %>%
arrange(ID, date1)

> ex2

A tibble: 5 × 6

 ID date1      date2      start      end          ind

<dbl> <date> <date> <date> <date> <dbl>
1 1 2000-01-01 2000-01-28 NA NA NA
2 1 2000-01-01 2000-01-15 2000-01-01 2001-07-31 1
3 1 2000-01-16 2001-07-31 NA NA NA
4 1 2000-01-29 2001-04-30 2000-01-01 2001-07-31 1
5 1 2001-05-01 2001-07-31 NA NA NA


</details>
# 答案1
**得分**: 0
你差不多理解了。主要错误是将 `start` 和 `end` 的值传递为列，而不是值，如下所示：
```R
newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)

这样，它将整个 start 列添加到开头，而不只是一个值。这就是为什么你的输出多了一些行。你想要做的是类似于 start[1]。

除此之外，我们可以做一些小的改进。

在 summarize 中创建一个 id = 0 列；
使你的 mutate 更加紧凑；
使用筛选器完全删除那些日期既在 start 之前又在 end 之后的行；
在 bind_rows() 之前取消选择 start 和 end 列。

结果:

# A tibble: 4 × 4
     ID date1      date2        ind
  <dbl> <date>     <date>     <dbl>
1     1 2000-01-01 2000-01-15     1
2     1 2000-01-16 2000-01-28     0
3     1 2000-01-29 2001-04-30     1
4     1 2001-05-01 2001-07-31     0

替代的 mutate：

如果你知道你的数据总是按日期递增的，你可以只更改 date1 的第一个值和 date2 的最后一个值：

mutate(date1 = c(if(date1[1] < start[1]) start[1] else date1[1], date1[-1]),
       date2 = c(date2[-n()], if(date2[n()] > end[1]) end[1] else date2[n()]))

这样，你可以避免不必要的检查，同时不会丢失日期格式。

英文:

You almost got it. The main mistake was passing the start and end values as a column, not as a value, in

newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)

This way, it's adding the whole start column at the beginning, not just one value. That's why your output had extra rows. You wanted to do something like start[1].

Besides that, we can do some minor improvements.

Create an id = 0 column in the summarize;
Make your mutate more compact;
Entirely remove rows where both dates fall before the start of after the end with a filter;
Deselect start and end columns before the bind_rows().

have2 &lt;- have %&gt;%
  group_by(ID) %&gt;%
  filter(! (date1 &lt; start &amp; date2 &lt; start) | (date1 &gt; end &amp; date2 &gt; end)) %&gt;% # Item 3
  mutate(date1 = ifelse(date1&lt;start,start,date1) %&gt;% as.Date(origin = &quot;1970-01-01&quot;),
         date2 = ifelse(date2&gt;end, end, date2)  %&gt;% as.Date(origin = &quot;1970-01-01&quot;)) # Item 2
# You don&#39;t necessarily need to ungroup, as we want grouped data for the summarize
have2 %&gt;%
  summarize(newdate1 = c(start[1], date2+1),
            newdate2 = c(date1-1, end[1]),
            ind = 0) %&gt;% # Item 1
  rename(date1 = newdate1, date2 = newdate2) %&gt;%
  filter(date2 &gt; date1) %&gt;%
  ungroup() %&gt;%
  bind_rows(select(have2, -c(start, end))) %&gt;% # Item 4
  arrange(ID, date1)

Result:

# A tibble: 4 &#215; 4
     ID date1      date2        ind
  &lt;dbl&gt; &lt;date&gt;     &lt;date&gt;     &lt;dbl&gt;
1     1 2000-01-01 2000-01-15     1
2     1 2000-01-16 2000-01-28     0
3     1 2000-01-29 2001-04-30     1
4     1 2001-05-01 2001-07-31     0

Alternative mutate:

If you know that your data is always in increasing dates, you can just change the first value of date1, and the last of date2:

mutate(date1 = c(if(date1[1] &lt; start[1]) start[1] else date1[1], date1[-1]),
       date2 = c(date2[-n()], if(date2[n()] &gt; end[1]) end[1] else date2[n()]))

This way you avoid unnecessary checks and don't lose the date formatting.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Complete missing, non-overlapping date ranges, truncated, variable start/end dates, indicator var

问题

A tibble: 5 × 6

A tibble: 2 × 6

A tibble: 5 × 6

将字符向量转换为带标签的数值向量

Shiny应用在项目根目录中运行时出现错误。

在R中如何向一系列变量添加后缀？

ggplot2柱状图与统计数据（凋亡/坏死分析）

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论