英文:
How to merge several rows with event data from multiple columns when based on unique character column?
问题
假设我有一个名为d
的数据框,其中包含以下内容:
> d
pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
1 A 0 2018-12-31 1 2011-02-01 0 2018-12-31 0 2018-12-31
2 A 1 2013-10-31 0 2018-12-31 0 2018-12-31 1 2013-10-31
3 A 0 2018-12-31 0 2018-12-31 1 2016-09-30 0 2018-12-31
4 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
5 C 0 2018-12-31 0 2018-12-31 0 2018-12-31 1 2006-01-17
6 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 0 2018-12-31
其中,d$pnr
是唯一的患者标识符,age.hl
、kon.hl
、sen.hl
和mix.hl
表示不同的疾病及其对应的时间变量。如果在任何一列中出现1,则不能再在同一列中再次出现1。
我需要合并行,使得d$pnr
只出现一次。也就是说,每列的事件数据/信息应合并到同一行中。
我在dplyr
中寻找解决方案。
期望的输出:
pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
1 A 1 2013-10-31 1 2011-02-01 1 2016-09-30 1 2013-10-31
2 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
3 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 1 2006-01-17
数据:
d <- data.frame(
pnr = c("A", "A", "A", "B", "C", "C"),
age.hl = c(0, 1, 0, 0, 0, 1),
age.hl.time = c(as.Date("2018-12-31"), as.Date("2013-10-31"), as.Date("2018-12-31"),
as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2008-07-31")),
kon.hl = c(1, 0, 0, 0, 0, 0),
kon.hl.time = c(as.Date("2011-02-01"), as.Date("2018-12-31"), as.Date("2018-12-31"),
as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2018-12-31")),
sen.hl = c(0, 0, 1, 1, 0, 1),
sen.hl.time = c(as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2016-09-30"),
as.Date("2004-04-30"), as.Date("2018-12-31"), as.Date("2009-01-31")),
mix.hl = c(0, 1, 0, 0, 1, 0),
mix.hl.time = c(as.Date("2018-12-31"), as.Date("2013-10-31"), as.Date("2018-12-31"),
as.Date("2018-12-31"), as.Date("2006-01-17"), as.Date("2018-12-31"))
)
英文:
Let's say I have d
:
> d
pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
1 A 0 2018-12-31 1 2011-02-01 0 2018-12-31 0 2018-12-31
2 A 1 2013-10-31 0 2018-12-31 0 2018-12-31 1 2013-10-31
3 A 0 2018-12-31 0 2018-12-31 1 2016-09-30 0 2018-12-31
4 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
5 C 0 2018-12-31 0 2018-12-31 0 2018-12-31 1 2006-01-17
6 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 0 2018-12-31
d$pnr
is a unique patient identifier. age.hl
, kon.hl
, sen.hl
and mix.hl
denote different diseases and their corresponding time variable. All patients are censored at 2018-12-31
if 0 event occured.
If 1 occur in any column, it cannot occur again in the same column.
I need to merge rows, so d$pnr
only occur one time. I.e, event data/information from each column shall be merged to the same row.
I look for a solution in dplyr
.
Expected output:
pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
1 A 1 2013-10-31 1 2011-02-01 1 2016-09-30 1 2013-10-31
2 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
3 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 1 2006-01-17
Data
d <- data.frame(
pnr = c("A", "A", "A", "B", "C", "C"),
age.hl = c(0, 1, 0, 0, 0, 1),
age.hl.time = c(as.Date("2018-12-31"), as.Date("2013-10-31"), as.Date("2018-12-31"),
as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2008-07-31")),
kon.hl = c(1, 0, 0, 0, 0, 0),
kon.hl.time = c(as.Date("2011-02-01"), as.Date("2018-12-31"), as.Date("2018-12-31"),
as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2018-12-31")),
sen.hl = c(0, 0, 1, 1, 0, 1),
sen.hl.time = c(as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2016-09-30"),
as.Date("2004-04-30"), as.Date("2018-12-31"), as.Date("2009-01-31")),
mix.hl = c(0, 1, 0, 0, 1, 0),
mix.hl.time = c(as.Date("2018-12-31"), as.Date("2013-10-31"), as.Date("2018-12-31"),
as.Date("2018-12-31"), as.Date("2006-01-17"), as.Date("2018-12-31"))
)
答案1
得分: 2
你可以使用which.max()
函数:
library(dplyr)
d %>%
summarise(across(ends_with("time"), ~ .x[which.max(get(sub(".time", "", cur_column())))]),
across(ends_with("hl"), max),
.by = pnr) %>%
select(names(d))
# pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
# 1 A 1 2013-10-31 1 2011-02-01 1 2016-09-30 1 2013-10-31
# 2 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
# 3 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 1 2006-01-17
注意:第一个across()
函数中的值依赖于第二个across()
函数中的值,所以这两个across()
函数不能互换位置。
英文:
You can use which.max()
:
library(dplyr)
d %>%
summarise(across(ends_with("time"), ~ .x[which.max(get(sub(".time", "", cur_column())))]),
across(ends_with("hl"), max),
.by = pnr) %>%
select(names(d))
# pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
# 1 A 1 2013-10-31 1 2011-02-01 1 2016-09-30 1 2013-10-31
# 2 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
# 3 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 1 2006-01-17
Note: The function in the first across()
involves the values in the second across()
, so the two across()
are not exchangeable.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论