英文:
tmerge() + coxph(): two ways of setting up dates should give same results, and don't
问题
使用tmerge()
创建用于时间变化协变量Cox回归的数据时,以两种方式表示时间应该产生相同的回归结果(我认为),但它们并不相同。
一种方式使用开始日期和结束日期,并在Surv()
内将其转换为数值;另一种方式只使用数值天数到事件。
示例
首先,创建一些数据。我们有一个ID、一个结果(death
)、每行的开始日期和稍后的结束日期。开始日期和结束日期是Date
对象。
n <- 1000
set.seed(0)
dd <- data.frame(id=1:n,
death=sample(x=c(FALSE, TRUE), prob=c(8, 1), size=n, replace=TRUE),
startDate=as.Date(runif(min=as.numeric(as.Date("2000-01-01")),
max=as.numeric(as.Date("2019-12-31")), n=n), origin="1970-01-01"))
dd$endDate <- as.Date(as.numeric(dd$startDate) +
rnorm(mean=3650, sd=500, n=n), origin="1970-01-01")
# (您可以检查endDate从未早于startDate。)
与每位参与者的开始和结束日期不同,我们可以将每个人的时间从零开始,并有一定数量的天数直到事件/截尾:
dd$startDay <- 0
dd$endDay <- as.numeric(dd$endDate - dd$startDate)
接下来,我们使用tmerge()
将数据转换为需要具有时间变化协变量的Cox回归的格式。(注意:这是一个最小示例,实际上没有任何时间变化的协变量。)
我们以两种方式进行这样的转换,以进行比较。1) 使用数值天数到事件/截尾;2) 使用日期。
使用天数
ddTv <- tmerge(data1=dd, data2=dd, id=id,
tstart=startDay, tstop=endDay, event=event(endDay, death))
ddTv[1:6, ]
id death startDate endDate startDay endDay tstart tstop event
1 TRUE 2005-04-23 2014-11-29 0 3506.6 0 3506.6 TRUE
2 FALSE 2010-08-13 2023-02-16 0 4570.6 0 4570.6 FALSE
3 FALSE 2013-09-11 2023-06-22 0 3571.6 0 3571.6 FALSE
4 FALSE 2007-08-31 2015-10-03 0 2955.1 0 2955.1 FALSE
5 TRUE 2019-02-05 2027-01-27 0 2913.4 0 2913.4 TRUE
6 FALSE 2002-05-14 2012-04-06 0 3615.2 0 3615.2 FALSE
使用日期
ddTvDate <- tmerge(data1=dd, data2=dd, id=id,
tstart=startDate, tstop=endDate, event=event(endDate, death))
ddTvDate[1:6, ]
id death startDate endDate startDay endDay tstart tstop event
1 TRUE 2005-04-23 2014-11-29 0 3506.6 2005-04-23 2014-11-29 TRUE
2 FALSE 2010-08-13 2023-02-16 0 4570.6 2010-08-13 2023-02-16 FALSE
3 FALSE 2013-09-11 2023-06-22 0 3571.6 2013-09-11 2023-06-22 FALSE
4 FALSE 2007-08-31 2015-10-03 0 2955.1 2007-08-31 2015-10-03 FALSE
5 TRUE 2019-02-05 2027-01-27 0 2913.4 2019-02-05 2027-01-27 TRUE
6 FALSE 2002-05-14 2012-04-06 0 3615.2 2002-05-14 2012-04-06 FALSE
最后,使用这两种方式表达相同的数据不会产生相同的回归结果。我们将仅比较空模型:
使用天数
ddMod <- coxph(formula=
Surv(time=tstart, time2=tstop, event=death) ~ 1,
data=ddTv)
ddMod
Null model
log likelihood= -702.08
n= 1000
使用日期
ddModDate <- coxph(formula=
Surv(time=as.numeric(tstart), time2=as.numeric(tstop), event=death) ~ 1,
data=ddTvDate)
ddModDate
Null model
log likelihood= -681.85
n= 1000
对数似然值相似,但不相同。
为什么它们不相同?
如果在模型中添加协变量,那么两个版本之间的系数和p值也不相同。
最后,如果不使用tmerge()
,直接在原始数据集上使用coxph()
,那么这两种方法都会给出相同的结果。以下两个模型
ddMod2 <- coxph(formula=
Surv(time=endDay, event=death) ~ 1,
data=dd)
ddMod2
ddModDate2 <- coxph(formula=
Surv(time=as.numeric(endDate - startDate), event=death) ~ 1,
data=dd)
ddModDate2
与上面使用天数的ddMod
版本给出相同的结果。
英文:
Basically, using tmerge()
to create data for time-varying-covariate Cox regression, two ways of expressing times should give the same regression results (I think), but they don't.
One way uses start and end dates, and converts to numeric within Surv()
; the other just uses numeric days to event.
Example
First, create some data. We have an ID, an outcome (death
), a start date for each row, and an end date some time later. The start date and end date are Date
objects.
n <- 1000
set.seed(0)
dd <- data.frame(id=1:n,
death=sample(x=c(FALSE, TRUE), prob=c(8, 1), size=n, replace=TRUE),
startDate=as.Date(runif(min=as.numeric(as.Date("2000-01-01")),
max=as.numeric(as.Date("2019-12-31")), n=n), origin="1970-01-01"))
dd$endDate <- as.Date(as.numeric(dd$startDate) +
rnorm(mean=3650, sd=500, n=n), origin="1970-01-01")
# (You can check that endDate is never before startDate.)
Rather than a start and end date for each participant, we could alternatively start each person's time at zero and have a numeric number of days until event/censor:
dd$startDay <- 0
dd$endDay <- as.numeric(dd$endDate - dd$startDate)
Next, we use tmerge()
to transform the data into the format that would be needed for Cox regression with time-varying covariates. (Note: this is a minimal example that does not actually have any time-varying covariates.)
We do this two ways, to compare. 1) Using numeric days to event/censor; 2) Using dates.
Using days
ddTv <- tmerge(data1=dd, data2=dd, id=id,
tstart=startDay, tstop=endDay, event=event(endDay, death))
ddTv[1:6, ]
id death startDate endDate startDay endDay tstart tstop event
1 TRUE 2005-04-23 2014-11-29 0 3506.6 0 3506.6 TRUE
2 FALSE 2010-08-13 2023-02-16 0 4570.6 0 4570.6 FALSE
3 FALSE 2013-09-11 2023-06-22 0 3571.6 0 3571.6 FALSE
4 FALSE 2007-08-31 2015-10-03 0 2955.1 0 2955.1 FALSE
5 TRUE 2019-02-05 2027-01-27 0 2913.4 0 2913.4 TRUE
6 FALSE 2002-05-14 2012-04-06 0 3615.2 0 3615.2 FALSE
Using dates
ddTvDate <- tmerge(data1=dd, data2=dd, id=id,
tstart=startDate, tstop=endDate, event=event(endDate, death))
ddTvDate[1:6, ]
id death startDate endDate startDay endDay tstart tstop event
1 TRUE 2005-04-23 2014-11-29 0 3506.6 2005-04-23 2014-11-29 TRUE
2 FALSE 2010-08-13 2023-02-16 0 4570.6 2010-08-13 2023-02-16 FALSE
3 FALSE 2013-09-11 2023-06-22 0 3571.6 2013-09-11 2023-06-22 FALSE
4 FALSE 2007-08-31 2015-10-03 0 2955.1 2007-08-31 2015-10-03 FALSE
5 TRUE 2019-02-05 2027-01-27 0 2913.4 2019-02-05 2027-01-27 TRUE
6 FALSE 2002-05-14 2012-04-06 0 3615.2 2002-05-14 2012-04-06 FALSE
Finally, using these two ways of expressing the same data don't give the same regression results. We'll compare just the null model:
Using days
ddMod <- coxph(formula=
Surv(time=tstart, time2=tstop, event=death) ~ 1,
data=ddTv)
ddMod
Null model
log likelihood= -702.08
n= 1000
Using dates
ddModDate <- coxph(formula=
Surv(time=as.numeric(tstart), time2=as.numeric(tstop), event=death) ~ 1,
data=ddTvDate)
ddModDate
Null model
log likelihood= -681.85
n= 1000
Log-likelihoods are similar, but not the same.
Why are these not the same?
If you add covariates to the model then coefficients and p values between the two versions are again not the same.
Finally, if you don't use tmerge()
, and go straight to coxph()
on the original dataset, then both methods give you the same results. Both of these models
ddMod2 <- coxph(formula=
Surv(time=endDay, event=death) ~ 1,
data=dd)
ddMod2
ddModDate2 <- coxph(formula=
Surv(time=as.numeric(endDate - startDate), event=death) ~ 1,
data=dd)
ddModDate2
give the same results as ddMod
above, the version using days.
答案1
得分: 0
教授Terry M. Therneau(survival
包的创建者)友善地给了我一个答案,并允许在这里发布。
简而言之,结果不同是因为这是两个完全不同的模型。例如,考虑一个参与者在研究中的第100天,即2010年1月1日发生了一次事件。
-
如果我使用日历日期作为时间,那么该事件的风险集合是2010年1月1日参与研究的所有人。
-
如果我使用入组后的时间作为时间,那么该事件的风险集合是在他们入组后的第100天仍在研究中的所有人。
这可能是非常不同的一组人!
几乎对于每项研究,入组后的时间是您想要的度量标准。
一旦他指出,这就很明显了,但在那之前对我来说不透明。
英文:
Professor Terry M. Therneau (creator of the survival
package) kindly gave me an answer, with permission to post here.
Paraphrasing ---
Basically, the results are different because those are two entirely different models. Consider, for example, a participant who had an event on the 100th day that they were in the study, on January 1, 2010.
-
If I use calendar dates for my times, then the risk set for that event is everyone who was in the study on January 1, 2010.
-
If I use time since entry for my times, then the risk set for that event is everyone who was still in the study on their 100th day since entry.
Those are probably very different sets of people!
For almost every study, time since entry is the measure you want.
Obvious once he points it out, opaque to me until then.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论