英文:
Group and merge rows by ID when there are identical start and end date fields in R columns
问题
我有一个带有ID、Start_date和End_date字段的数据集。该数据集中的某些记录具有与同一ID的不同行中的开始日期相匹配的结束日期。例如,一个ID可能具有日期2002-01-20到2002-01-22,另一条记录在不同的行中具有日期2002-01-22到2002-01-23。我希望将这些记录合并在一起,以便对于该ID只有一条记录,该记录具有来自第一条记录的开始日期(2002-01-20)和来自第二条记录的结束日期(2002-01-23)。
示例原始数据:
structure(list(ID = c(565, 898, 521, 522, 522, 323, 887, 887,
522), Start_date = structure(c(12846, 19172, 19341, 19495, 19497,
19495, 14194, 14204, 18786), class = "Date"), End_date = structure(c(12847,
19174, 19347, 19497, 19499, 19497, 14203, 14206, 18798), class = "Date")), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
示例最终数据:
structure(list(ID = c(565, 898, 521, 522, 323, 887, 887, 522),
Start_date = structure(c(12846, 19172, 19341, 19495, 19495,
14194, 14204, 18786), class = "Date"), End_date = structure(c(12847,
19174, 19347, 19499, 19497, 14203, 14206, 18798), class = "Date")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
在stackoverflow上似乎有一些关于这个问题的SQL解决方案,但我没有找到一个使用R的解决方案。谢谢。
英文:
I have a dataset with ID, Start_date and End_date fields. Some records in this dataset have an end date that matches the start date for the same ID, but in a different row. For instance, an ID could have dates 2002-01-20 to 2002-01-22 and another record on a different row with 2002-01-22 to 2002-01-23. I wish to merge those records together so I have only one record for that ID that has the start date from the first record (2002-01-20) and the end date from the second record (2002-01-23).
Example original data:
structure(list(ID = c(565, 898, 521, 522, 522, 323, 887, 887,
522), Start_date = structure(c(12846, 19172, 19341, 19495, 19497,
19495, 14194, 14204, 18786), class = "Date"), End_date = structure(c(12847,
19174, 19347, 19497, 19499, 19497, 14203, 14206, 18798), class = "Date")), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
Example final data:
structure(list(ID = c(565, 898, 521, 522, 323, 887, 887, 522),
Start_date = structure(c(12846, 19172, 19341, 19495, 19495,
14194, 14204, 18786), class = "Date"), End_date = structure(c(12847,
19174, 19347, 19499, 19497, 14203, 14206, 18798), class = "Date")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
There seem to be a number of sql solutions on stackoverflow for this problem, but I wasn't able to locate one using R. Thank you.
答案1
得分: 0
首先,我们可以使用inner_join
将数据集与自身连接,并仅保留具有相同ID和匹配的Start/End日期的行:
library(tidyverse)
dat <- structure(list(ID = c(565, 898, 521, 522, 522, 323, 887, 887,
522), Start_date = structure(c(12846, 19172, 19341, 19495, 19497,
19495, 14194, 14204, 18786), class = "Date"), End_date = structure(c(12847,
19174, 19347, 19497, 19499, 19497, 14203, 14206, 18798), class = "Date")), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
processed <- structure(list(ID = c(565, 898, 521, 522, 323, 887, 887, 522),
Start_date = structure(c(12846, 19172, 19341, 19495, 19495,
14194, 14204, 18786), class = "Date"), End_date = structure(c(12847,
19174, 19347, 19499, 19497, 14203, 14206, 18798), class = "Date")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
dat |>
inner_join(dat,
by = c("ID", "End_date" = "Start_date"))
#> # A tibble: 1 × 4
#> ID Start_date End_date End_date.y
#> <dbl> <date> <date> <date>
#> 1 522 2023-05-18 2023-05-20 2023-05-22
然后,我们重命名列以保留正确的End_date:
dat |>
inner_join(dat,
by = c("ID", "End_date" = "Start_date")) |>
select(ID, Start_date, End_date = End_date.y)
然后,我们执行相同的操作,使用anti_join
来删除这些行。因为我们要删除两行(具有Start日期和End日期的行),所以需要执行两次:
dat |>
anti_join(dat,
by = c("ID", "End_date" = "Start_date")) |>
anti_join(dat,
by = c("ID", "Start_date" = "End_date"))
#> # A tibble: 8 × 3
#> ID Start_date End_date
#> <dbl> <date> <date>
#> 1 565 2005-03-04 2005-03-05
#> 2 898 2022-06-29 2022-07-01
#> 3 521 2022-12-15 2022-12-21
#> 4 323 2023-05-18 2023-05-20
#> 5 887 2008-11-11 2008-11-20
#> 6 887 2008-11-21 2008-11-23
#> 7 522 2021-06-08 2021-06-20
最后,我们完成了,只需将这两个结果合并:
combined <- dat |>
inner_join(dat,
by = c("ID", "End_date" = "Start_date")) |>
select(ID, Start_date, End_date = End_date.y)
res <- dat |>
anti_join(dat,
by = c("ID", "End_date" = "Start_date")) |>
anti_join(dat,
by = c("ID", "Start_date" = "End_date")) |>
bind_rows(combined)
all.equal(arrange(res, ID, Start_date),
arrange(processed, ID, Start_date))
#> [1] TRUE
英文:
First, we can use an inner_join
to join the dataset to itself, and keep only the rows with the same ID and matching Start/End dates:
library(tidyverse)
dat <- structure(list(ID = c(565, 898, 521, 522, 522, 323, 887, 887,
522), Start_date = structure(c(12846, 19172, 19341, 19495, 19497,
19495, 14194, 14204, 18786), class = "Date"), End_date = structure(c(12847,
19174, 19347, 19497, 19499, 19497, 14203, 14206, 18798), class = "Date")), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
processed <- structure(list(ID = c(565, 898, 521, 522, 323, 887, 887, 522),
Start_date = structure(c(12846, 19172, 19341, 19495, 19495,
14194, 14204, 18786), class = "Date"), End_date = structure(c(12847,
19174, 19347, 19499, 19497, 14203, 14206, 18798), class = "Date")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
dat |>
inner_join(dat,
by = c("ID", "End_date" = "Start_date"))
#> # A tibble: 1 × 4
#> ID Start_date End_date End_date.y
#> <dbl> <date> <date> <date>
#> 1 522 2023-05-18 2023-05-20 2023-05-22
We rename the columns to keep the correct End_date only:
dat |>
inner_join(dat,
by = c("ID", "End_date" = "Start_date")) |>
select(ID, Start_date, End_date = End_date.y)
Then we do the same operation, with an anti_join
to remove these rows. Because we want to remove both rows (the one with the Start date, and the one with the End date), we need to do it twice:
dat |>
anti_join(dat,
by = c("ID", "End_date" = "Start_date")) |>
anti_join(dat,
by = c("ID", "Start_date" = "End_date"))
#> # A tibble: 8 × 3
#> ID Start_date End_date
#> <dbl> <date> <date>
#> 1 565 2005-03-04 2005-03-05
#> 2 898 2022-06-29 2022-07-01
#> 3 521 2022-12-15 2022-12-21
#> 4 323 2023-05-18 2023-05-20
#> 5 887 2008-11-11 2008-11-20
#> 6 887 2008-11-21 2008-11-23
#> 7 522 2021-06-08 2021-06-20
And we're done, we can just combine these two:
combined <- dat |>
inner_join(dat,
by = c("ID", "End_date" = "Start_date")) |>
select(ID, Start_date, End_date = End_date.y)
res <- dat |>
anti_join(dat,
by = c("ID", "End_date" = "Start_date")) |>
anti_join(dat,
by = c("ID", "Start_date" = "End_date")) |>
bind_rows(combined)
all.equal(arrange(res, ID, Start_date),
arrange(processed, ID, Start_date))
#> [1] TRUE
答案2
得分: 0
library(dplyr)
df |>
mutate(End_date = as.Date(ifelse(End_date %in% Start_date, max(End_date), End_date)),
.by = ID) |>
distinct(ID, End_date, .keep_all = T)
它是如何工作的
- 在一个
ID
下(例如,.by = ID
),我们查找End_date
也是另一行中的Start_date
的行。 - 当满足这个条件时,在该
ID
和该行内,我们将End_date
设置为最大日期。最大日期返回最近的时间顺序日期。当这个操作完成后,对于给定的ID
,End_date
现在在两行中都有。 - 最后,我们可以使用
distinct
简单地删除那一行,该行会保留具有唯一的ID
和End_date
值的第一行。
注意:.by
是实验性的,并在 packageVersion("dplyr")
1.1.0 中作为新功能引入的。如果您使用的是较旧版本的 dplyr
,则可以执行以下操作:
df |>
group_by(ID) |>
mutate(...) |> # ...与上述代码相同
ungroup() |>
distinct(...) # ...与上述代码相同
输出
ID Start_date End_date
<dbl> <date> <date>
1 565 2005-03-04 2005-03-05
2 898 2022-06-29 2022-07-01
3 521 2022-12-15 2022-12-21
4 522 2023-05-18 2023-05-22
5 323 2023-05-18 2023-05-20
6 887 2008-11-11 2008-11-20
7 887 2008-11-21 2008-11-23
8 522 2021-06-08 2021-06-20
一个需要注意的情况是,在一个 ID
中有匹配的 Start_date
和 End_date
,但还有一个更近日期的附加非匹配行(第三行):
df2 <- data.frame(ID = '522',
Start_date = as.Date(c("2023-05-18", "2023-05-20", "2023-05-25")),
End_date = as.Date(c("2023-05-20", "2023-05-22", "2023-05-28")))
然后,您可以执行以下操作:
df2 |>
mutate(End_date = as.Date(ifelse(End_date %in% Start_date, End_date[which(Start_date %in% End_date)], End_date)),
.by = ID) |>
distinct(ID, End_date, .keep_all = T)
英文:
library(dplyr)
df |>
mutate(End_date = as.Date(ifelse(End_date %in% Start_date, max(End_date), End_date)),
.by = ID) |>
distinct(ID, End_date, .keep_all = T)
How it works
- Within an
ID
(e.g.,.by = ID
), we find rows where theEnd_date
is also theStart_date
in another row. - When that condition is met then within that
ID
and for that row, we setEnd_date
to be the max date. The max date returns the most recent chronological date. When this operation is doneEnd_date
is now duplicated in two rows for a givenID
. - Lastly, we can simply remove that row using
distinct
which will keep the first row with unique values inID
andEnd_date
.
Note: .by
is experimental and new as of packageVersion("dplyr")
1.1.0. If you are using an older version of dplyr
then you would do something like:
df |>
group_by(ID) |>
mutate(...) |> # ... same code as above
ungroup() |>
distinct(...) # ... same code as above
Output
ID Start_date End_date
<dbl> <date> <date>
1 565 2005-03-04 2005-03-05
2 898 2022-06-29 2022-07-01
3 521 2022-12-15 2022-12-21
4 522 2023-05-18 2023-05-22
5 323 2023-05-18 2023-05-20
6 887 2008-11-11 2008-11-20
7 887 2008-11-21 2008-11-23
8 522 2021-06-08 2021-06-20
One warning is the following scenario where you have a matching Start_date
and End_date
within an ID
, but an additional non-matching row with a more recent date (the third row):
df2 <- data.frame(ID = '522',
Start_date = as.Date(c("2023-05-18", "2023-05-20", "2023-05-25")),
End_date = as.Date(c("2023-05-20", "2023-05-22", "2023-05-28")))
You could then do something like:
df2 |>
mutate(End_date = as.Date(ifelse(End_date %in% Start_date, End_date[which(Start_date %in% End_date)], End_date)),
.by = ID) |>
distinct(ID, End_date, .keep_all = T)
答案3
得分: 0
我最终使用以下代码解决了这个问题,考虑到了我的数据中的一些细微差异,包括:(1) 想要保留与记录相关的原始数据集中的额外变量,(2) 能够处理可能有3或4条记录需要链接在一起的情况;以及(3) 在我的数据集中,ID可以重复出现,有些我希望合并,而另一些我希望保持分开,因为它们不符合我的标准(相同的结束/开始日期)。
raw_data <- raw_data %>%
arrange(ID, START_DATE)
raw_data <- raw_data %>%
group_by(ID) %>%
mutate(Combined = ifelse(!is.na(lag(END_DATE)) & START_DATE == lag(END_DATE), TRUE, FALSE)) %>%
ungroup()
multiples <- raw_data %>%
mutate(Group = cumsum(!Combined)) %>%
group_by(Group) %>%
filter(n() > 1) %>%
ungroup()
singles <- raw_data %>%
mutate(Group = cumsum(!Combined)) %>%
group_by(Group) %>%
filter(n() == 1) %>%
ungroup() %>%
select(-Combined)
multiples <- multiples %>%
group_by(ID, Group) %>%
summarize(START_DATE = first(START_DATE),
END_DATE = last(END_DATE)) %>%
ungroup()
raw_data <- raw_data %>%
select(-START_DATE, -END_DATE, -Combined)
# 这一步是为了在合并了我想要合并的记录之后,重新引入感兴趣的附加变量到我的数据集中。
multiples <- multiples %>%
left_join(raw_data, by = "ID")
# 这个步骤在左连接后对我的记录进行去重,只留下每个记录的单一记录
multiples <- setDT(multiples)[order(ID, Group), .SD[1], by=list(ID, Group)]
final_dataset <- rbind(multiples, singles)
英文:
I ended up solving this using the following code, which accounted for a few nuances in my data, including (1) wanting to retain additional variables from my original dataset associated with the records (2) being able to account for instances where there may be 3 or 4 records that I wished to link together; and (3) the fact that IDs can and do repeat in my dataset, some which I wanted merged, while others I wanted to remain separate because they didn't meet my criteria (same end/start dates).
raw_data <- raw_data %>%
arrange(ID, START_DATE)
raw_data <- raw_data %>%
group_by(ID) %>%
mutate(Combined = ifelse(!is.na(lag(END_DATE)) & START_DATE == lag(END_DATE), TRUE, FALSE)) %>%
ungroup()
multiples <- raw_data %>%
mutate(Group = cumsum(!Combined)) %>%
group_by(Group) %>%
filter(n() > 1) %>%
ungroup()
singles <- raw_data %>%
mutate(Group = cumsum(!Combined)) %>%
group_by(Group) %>%
filter(n() == 1) %>%
ungroup() %>%
select(-Combined)
multiples <- multiples %>%
group_by(ID, Group) %>%
summarize(START_DATE = first(START_DATE),
END_DATE = last(END_DATE)) %>%
ungroup()
raw_data <- raw_data %>%
select(-START_DATE, -END_DATE, -Combined)
#This step is so I can reintroduce the additional variables of interest back into my dataset, after I combined the records I wanted combined.
multiples <- multiples %>%
left_join(raw_data, by = "ID")
#This deduplicates my records after the left join, leaving only a single record for each
multiples <- setDT(multiples)[order(ID, Group), .SD[1], by=list(ID, Group)]
final_dataset <- rbind(multiples, singles)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论