2023年6月9日 01:27:03go评论95阅读模式

英文:

Group and merge rows by ID when there are identical start and end date fields in R columns

问题

我有一个带有ID、Start_date和End_date字段的数据集。该数据集中的某些记录具有与同一ID的不同行中的开始日期相匹配的结束日期。例如，一个ID可能具有日期2002-01-20到2002-01-22，另一条记录在不同的行中具有日期2002-01-22到2002-01-23。我希望将这些记录合并在一起，以便对于该ID只有一条记录，该记录具有来自第一条记录的开始日期（2002-01-20）和来自第二条记录的结束日期（2002-01-23）。

示例原始数据：

structure(list(ID = c(565, 898, 521, 522, 522, 323, 887, 887, 
                      522), Start_date = structure(c(12846, 19172, 19341, 19495, 19497, 
                                                     19495, 14194, 14204, 18786), class = "Date"), End_date = structure(c(12847, 
                                                                                                                          19174, 19347, 19497, 19499, 19497, 14203, 14206, 18798), class = "Date")), row.names = c(NA, 
                                                                                                                                                                                                                   -9L), class = c("tbl_df", "tbl", "data.frame"))

示例最终数据：

structure(list(ID = c(565, 898, 521, 522, 323, 887, 887, 522), 
               Start_date = structure(c(12846, 19172, 19341, 19495, 19495, 
                                        14194, 14204, 18786), class = "Date"), End_date = structure(c(12847, 
                                                                                                      19174, 19347, 19499, 19497, 14203, 14206, 18798), class = "Date")), row.names = c(NA, 
                                                                                                                                                                                        -8L), class = c("tbl_df", "tbl", "data.frame"))

在stackoverflow上似乎有一些关于这个问题的SQL解决方案，但我没有找到一个使用R的解决方案。谢谢。

英文:

I have a dataset with ID, Start_date and End_date fields. Some records in this dataset have an end date that matches the start date for the same ID, but in a different row. For instance, an ID could have dates 2002-01-20 to 2002-01-22 and another record on a different row with 2002-01-22 to 2002-01-23. I wish to merge those records together so I have only one record for that ID that has the start date from the first record (2002-01-20) and the end date from the second record (2002-01-23).

Example original data:

structure(list(ID = c(565, 898, 521, 522, 522, 323, 887, 887, 
                      522), Start_date = structure(c(12846, 19172, 19341, 19495, 19497, 
                                                     19495, 14194, 14204, 18786), class = &quot;Date&quot;), End_date = structure(c(12847, 
                                                                                                                          19174, 19347, 19497, 19499, 19497, 14203, 14206, 18798), class = &quot;Date&quot;)), row.names = c(NA, 
                                                                                                                                                                                                                   -9L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))

Example final data:

structure(list(ID = c(565, 898, 521, 522, 323, 887, 887, 522), 
               Start_date = structure(c(12846, 19172, 19341, 19495, 19495, 
                                        14194, 14204, 18786), class = &quot;Date&quot;), End_date = structure(c(12847, 
                                                                                                      19174, 19347, 19499, 19497, 14203, 14206, 18798), class = &quot;Date&quot;)), row.names = c(NA, 
                                                                                                                                                                                        -8L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))

There seem to be a number of sql solutions on stackoverflow for this problem, but I wasn't able to locate one using R. Thank you.

答案1

得分: 0

首先，我们可以使用inner_join将数据集与自身连接，并仅保留具有相同ID和匹配的Start/End日期的行：

library(tidyverse)
dat &lt;- structure(list(ID = c(565, 898, 521, 522, 522, 323, 887, 887, 
                             522), Start_date = structure(c(12846, 19172, 19341, 19495, 19497, 
                                                            19495, 14194, 14204, 18786), class = &quot;Date&quot;), End_date = structure(c(12847, 
                                                                                                                                 19174, 19347, 19497, 19499, 19497, 14203, 14206, 18798), class = &quot;Date&quot;)), row.names = c(NA, 
                                                                                                                                                                                                                          -9L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))
processed &lt;- structure(list(ID = c(565, 898, 521, 522, 323, 887, 887, 522), 
                            Start_date = structure(c(12846, 19172, 19341, 19495, 19495, 
                                                     14194, 14204, 18786), class = &quot;Date&quot;), End_date = structure(c(12847, 
                                                                                                                   19174, 19347, 19499, 19497, 14203, 14206, 18798), class = &quot;Date&quot;)), row.names = c(NA, 
                                                                                                                                                                                                     -8L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))
dat |&gt;
  inner_join(dat,
            by = c(&quot;ID&quot;, &quot;End_date&quot; = &quot;Start_date&quot;))
#&gt; # A tibble: 1 &#215; 4
#&gt;      ID Start_date End_date   End_date.y
#&gt;   &lt;dbl&gt; &lt;date&gt;     &lt;date&gt;     &lt;date&gt;    
#&gt; 1   522 2023-05-18 2023-05-20 2023-05-22

然后，我们重命名列以保留正确的End_date：

dat |&gt;
  inner_join(dat,
            by = c(&quot;ID&quot;, &quot;End_date&quot; = &quot;Start_date&quot;)) |&gt;
  select(ID, Start_date, End_date = End_date.y)

然后，我们执行相同的操作，使用anti_join来删除这些行。因为我们要删除两行（具有Start日期和End日期的行），所以需要执行两次：

dat |&gt;
  anti_join(dat,
             by = c(&quot;ID&quot;, &quot;End_date&quot; = &quot;Start_date&quot;)) |&gt;
  anti_join(dat,
            by = c(&quot;ID&quot;, &quot;Start_date&quot; = &quot;End_date&quot;))
#&gt; # A tibble: 8 &#215; 3
#&gt;      ID Start_date End_date  
#&gt;   &lt;dbl&gt; &lt;date&gt;     &lt;date&gt;    
#&gt; 1   565 2005-03-04 2005-03-05
#&gt; 2   898 2022-06-29 2022-07-01
#&gt; 3   521 2022-12-15 2022-12-21
#&gt; 4   323 2023-05-18 2023-05-20
#&gt; 5   887 2008-11-11 2008-11-20
#&gt; 6   887 2008-11-21 2008-11-23
#&gt; 7   522 2021-06-08 2021-06-20

最后，我们完成了，只需将这两个结果合并：

combined &lt;- dat |&gt;
  inner_join(dat,
            by = c(&quot;ID&quot;, &quot;End_date&quot; = &quot;Start_date&quot;)) |&gt;
  select(ID, Start_date, End_date = End_date.y)
res &lt;- dat |&gt;
  anti_join(dat,
             by = c(&quot;ID&quot;, &quot;End_date&quot; = &quot;Start_date&quot;)) |&gt;
  anti_join(dat,
            by = c(&quot;ID&quot;, &quot;Start_date&quot; = &quot;End_date&quot;)) |&gt;
  bind_rows(combined)
all.equal(arrange(res, ID, Start_date),
          arrange(processed, ID, Start_date))
#&gt; [1] TRUE

英文:

First, we can use an inner_join to join the dataset to itself, and keep only the rows with the same ID and matching Start/End dates:

library(tidyverse)
dat &lt;- structure(list(ID = c(565, 898, 521, 522, 522, 323, 887, 887, 
                             522), Start_date = structure(c(12846, 19172, 19341, 19495, 19497, 
                                                            19495, 14194, 14204, 18786), class = &quot;Date&quot;), End_date = structure(c(12847, 
                                                                                                                                 19174, 19347, 19497, 19499, 19497, 14203, 14206, 18798), class = &quot;Date&quot;)), row.names = c(NA, 
                                                                                                                                                                                                                          -9L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))
processed &lt;- structure(list(ID = c(565, 898, 521, 522, 323, 887, 887, 522), 
                            Start_date = structure(c(12846, 19172, 19341, 19495, 19495, 
                                                     14194, 14204, 18786), class = &quot;Date&quot;), End_date = structure(c(12847, 
                                                                                                                   19174, 19347, 19499, 19497, 14203, 14206, 18798), class = &quot;Date&quot;)), row.names = c(NA, 
                                                                                                                                                                                                     -8L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))
dat |&gt;
  inner_join(dat,
            by = c(&quot;ID&quot;, &quot;End_date&quot; = &quot;Start_date&quot;))
#&gt; # A tibble: 1 &#215; 4
#&gt;      ID Start_date End_date   End_date.y
#&gt;   &lt;dbl&gt; &lt;date&gt;     &lt;date&gt;     &lt;date&gt;    
#&gt; 1   522 2023-05-18 2023-05-20 2023-05-22

We rename the columns to keep the correct End_date only:

dat |&gt;
  inner_join(dat,
            by = c(&quot;ID&quot;, &quot;End_date&quot; = &quot;Start_date&quot;)) |&gt;
  select(ID, Start_date, End_date = End_date.y)

Then we do the same operation, with an anti_join to remove these rows. Because we want to remove both rows (the one with the Start date, and the one with the End date), we need to do it twice:

dat |&gt;
  anti_join(dat,
             by = c(&quot;ID&quot;, &quot;End_date&quot; = &quot;Start_date&quot;)) |&gt;
  anti_join(dat,
            by = c(&quot;ID&quot;, &quot;Start_date&quot; = &quot;End_date&quot;))
#&gt; # A tibble: 8 &#215; 3
#&gt;      ID Start_date End_date  
#&gt;   &lt;dbl&gt; &lt;date&gt;     &lt;date&gt;    
#&gt; 1   565 2005-03-04 2005-03-05
#&gt; 2   898 2022-06-29 2022-07-01
#&gt; 3   521 2022-12-15 2022-12-21
#&gt; 4   323 2023-05-18 2023-05-20
#&gt; 5   887 2008-11-11 2008-11-20
#&gt; 6   887 2008-11-21 2008-11-23
#&gt; 7   522 2021-06-08 2021-06-20

And we're done, we can just combine these two:

combined &lt;- dat |&gt;
  inner_join(dat,
            by = c(&quot;ID&quot;, &quot;End_date&quot; = &quot;Start_date&quot;)) |&gt;
  select(ID, Start_date, End_date = End_date.y)
res &lt;- dat |&gt;
  anti_join(dat,
             by = c(&quot;ID&quot;, &quot;End_date&quot; = &quot;Start_date&quot;)) |&gt;
  anti_join(dat,
            by = c(&quot;ID&quot;, &quot;Start_date&quot; = &quot;End_date&quot;)) |&gt;
  bind_rows(combined)
all.equal(arrange(res, ID, Start_date),
          arrange(processed, ID, Start_date))
#&gt; [1] TRUE

答案2

得分: 0

library(dplyr)
df |&gt; 
  mutate(End_date = as.Date(ifelse(End_date %in% Start_date, max(End_date), End_date)),
         .by = ID) |&gt;
  distinct(ID, End_date, .keep_all = T)

它是如何工作的

在一个 ID 下（例如，.by = ID），我们查找 End_date 也是另一行中的 Start_date 的行。
当满足这个条件时，在该 ID 和该行内，我们将 End_date 设置为最大日期。最大日期返回最近的时间顺序日期。当这个操作完成后，对于给定的 ID，End_date 现在在两行中都有。
最后，我们可以使用 distinct 简单地删除那一行，该行会保留具有唯一的 ID 和 End_date 值的第一行。

注意：.by 是实验性的，并在 packageVersion("dplyr") 1.1.0 中作为新功能引入的。如果您使用的是较旧版本的 dplyr，则可以执行以下操作：

df |&gt;
  group_by(ID) |&gt;
  mutate(...) |&gt; # ...与上述代码相同
  ungroup() |&gt;
  distinct(...) # ...与上述代码相同

输出

     ID Start_date End_date  
  &lt;dbl&gt; &lt;date&gt;     &lt;date&gt;    
1   565 2005-03-04 2005-03-05
2   898 2022-06-29 2022-07-01
3   521 2022-12-15 2022-12-21
4   522 2023-05-18 2023-05-22
5   323 2023-05-18 2023-05-20
6   887 2008-11-11 2008-11-20
7   887 2008-11-21 2008-11-23
8   522 2021-06-08 2021-06-20

一个需要注意的情况是，在一个 ID 中有匹配的 Start_date 和 End_date，但还有一个更近日期的附加非匹配行（第三行）：

df2 &lt;- data.frame(ID = '522',
           Start_date = as.Date(c("2023-05-18", "2023-05-20", "2023-05-25")),
           End_date = as.Date(c("2023-05-20", "2023-05-22", "2023-05-28")))

然后，您可以执行以下操作：

df2 |&gt; 
  mutate(End_date = as.Date(ifelse(End_date %in% Start_date, End_date[which(Start_date %in% End_date)], End_date)),
         .by = ID) |&gt;
  distinct(ID, End_date, .keep_all = T)

英文:

library(dplyr)
df |&gt; 
  mutate(End_date = as.Date(ifelse(End_date %in% Start_date, max(End_date), End_date)),
         .by = ID) |&gt;
  distinct(ID, End_date, .keep_all = T)

How it works

Within an ID (e.g., .by = ID), we find rows where the End_date is also the Start_date in another row.
When that condition is met then within that ID and for that row, we set End_date to be the max date. The max date returns the most recent chronological date. When this operation is done End_date is now duplicated in two rows for a given ID.
Lastly, we can simply remove that row using distinct which will keep the first row with unique values in ID and End_date.

Note: .by is experimental and new as of packageVersion("dplyr") 1.1.0. If you are using an older version of dplyr then you would do something like:

df |&gt;
  group_by(ID) |&gt;
  mutate(...) |&gt; # ... same code as above
  ungroup() |&gt;
  distinct(...) # ... same code as above

Output

     ID Start_date End_date  
  &lt;dbl&gt; &lt;date&gt;     &lt;date&gt;    
1   565 2005-03-04 2005-03-05
2   898 2022-06-29 2022-07-01
3   521 2022-12-15 2022-12-21
4   522 2023-05-18 2023-05-22
5   323 2023-05-18 2023-05-20
6   887 2008-11-11 2008-11-20
7   887 2008-11-21 2008-11-23
8   522 2021-06-08 2021-06-20

One warning is the following scenario where you have a matching Start_date and End_date within an ID, but an additional non-matching row with a more recent date (the third row):

df2 &lt;- data.frame(ID = &#39;522&#39;,
           Start_date = as.Date(c(&quot;2023-05-18&quot;, &quot;2023-05-20&quot;, &quot;2023-05-25&quot;)),
           End_date = as.Date(c(&quot;2023-05-20&quot;, &quot;2023-05-22&quot;, &quot;2023-05-28&quot;)))

You could then do something like:

df2 |&gt; 
  mutate(End_date = as.Date(ifelse(End_date %in% Start_date, End_date[which(Start_date %in% End_date)], End_date)),
         .by = ID) |&gt;
  distinct(ID, End_date, .keep_all = T)

答案3

得分: 0

我最终使用以下代码解决了这个问题，考虑到了我的数据中的一些细微差异，包括：(1) 想要保留与记录相关的原始数据集中的额外变量，(2) 能够处理可能有3或4条记录需要链接在一起的情况；以及(3) 在我的数据集中，ID可以重复出现，有些我希望合并，而另一些我希望保持分开，因为它们不符合我的标准（相同的结束/开始日期）。

raw_data <- raw_data %>%
  arrange(ID, START_DATE)
raw_data <- raw_data %>%
  group_by(ID) %>%
  mutate(Combined = ifelse(!is.na(lag(END_DATE)) & START_DATE == lag(END_DATE), TRUE, FALSE)) %>%
  ungroup()
multiples <- raw_data %>%  
  mutate(Group = cumsum(!Combined)) %>%
  group_by(Group) %>%
  filter(n() > 1) %>%
  ungroup()
singles <- raw_data %>%  
  mutate(Group = cumsum(!Combined)) %>%
  group_by(Group) %>%
  filter(n() == 1) %>%
  ungroup() %>%
  select(-Combined)
multiples <- multiples %>%  
  group_by(ID, Group) %>%
  summarize(START_DATE = first(START_DATE),
            END_DATE = last(END_DATE)) %>%
  ungroup()
raw_data <- raw_data %>%
  select(-START_DATE, -END_DATE, -Combined)
# 这一步是为了在合并了我想要合并的记录之后，重新引入感兴趣的附加变量到我的数据集中。
multiples <- multiples %>%
  left_join(raw_data, by = "ID")
# 这个步骤在左连接后对我的记录进行去重，只留下每个记录的单一记录
multiples <- setDT(multiples)[order(ID, Group), .SD[1], by=list(ID, Group)]
final_dataset <- rbind(multiples, singles)

英文:

I ended up solving this using the following code, which accounted for a few nuances in my data, including (1) wanting to retain additional variables from my original dataset associated with the records (2) being able to account for instances where there may be 3 or 4 records that I wished to link together; and (3) the fact that IDs can and do repeat in my dataset, some which I wanted merged, while others I wanted to remain separate because they didn't meet my criteria (same end/start dates).

raw_data &lt;- raw_data %&gt;%
  arrange(ID, START_DATE)
raw_data &lt;- raw_data %&gt;%
  group_by(ID) %&gt;%
  mutate(Combined = ifelse(!is.na(lag(END_DATE)) &amp; START_DATE == lag(END_DATE), TRUE, FALSE)) %&gt;%
  ungroup()
multiples &lt;- raw_data %&gt;%  
  mutate(Group = cumsum(!Combined)) %&gt;%
  group_by(Group) %&gt;%
  filter(n() &gt; 1) %&gt;%
  ungroup()
singles &lt;- raw_data %&gt;%  
  mutate(Group = cumsum(!Combined)) %&gt;%
  group_by(Group) %&gt;%
  filter(n() == 1) %&gt;%
  ungroup() %&gt;%
  select(-Combined)
multiples &lt;- multiples %&gt;%  
  group_by(ID, Group) %&gt;%
  summarize(START_DATE = first(START_DATE),
            END_DATE = last(END_DATE)) %&gt;%
  ungroup()
raw_data &lt;- raw_data %&gt;%
  select(-START_DATE, -END_DATE, -Combined)
#This step is so I can reintroduce the additional variables of interest back into my dataset, after I combined the records I wanted combined.
multiples &lt;- multiples %&gt;%
  left_join(raw_data, by = &quot;ID&quot;)
#This deduplicates my records after the left join, leaving only a single record for each
multiples &lt;- setDT(multiples)[order(ID, Group), .SD[1], by=list(ID, Group)]
final_dataset &lt;- rbind(multiples, singles)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Group and merge rows by ID when there are identical start and end date fields in R columns

问题

答案1

答案2

答案3

如何在R中编写匿名函数箭头形式。

定义rhandsontable中contextMenu的自定义项目

将TukeyHSD测试的结果与基本描述性统计数据按因子组导出到Excel。

如何在R中对不平衡的嵌套rma.mv元分析模型使用emmprep？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论