2023年6月13日 01:59:01go评论124阅读模式

英文:

tmerge() + coxph(): two ways of setting up dates should give same results, and don't

问题

使用tmerge()创建用于时间变化协变量Cox回归的数据时，以两种方式表示时间应该产生相同的回归结果（我认为），但它们并不相同。

一种方式使用开始日期和结束日期，并在Surv()内将其转换为数值；另一种方式只使用数值天数到事件。

示例

首先，创建一些数据。我们有一个ID、一个结果（death）、每行的开始日期和稍后的结束日期。开始日期和结束日期是Date对象。

n <- 1000
set.seed(0)
dd <- data.frame(id=1:n, 
  death=sample(x=c(FALSE, TRUE), prob=c(8, 1), size=n, replace=TRUE), 
  startDate=as.Date(runif(min=as.numeric(as.Date("2000-01-01")), 
    max=as.numeric(as.Date("2019-12-31")), n=n), origin="1970-01-01"))
dd$endDate <- as.Date(as.numeric(dd$startDate) + 
    rnorm(mean=3650, sd=500, n=n), origin="1970-01-01")
# （您可以检查endDate从未早于startDate。）

与每位参与者的开始和结束日期不同，我们可以将每个人的时间从零开始，并有一定数量的天数直到事件/截尾：

dd$startDay <- 0
dd$endDay <- as.numeric(dd$endDate - dd$startDate)

接下来，我们使用tmerge()将数据转换为需要具有时间变化协变量的Cox回归的格式。（注意：这是一个最小示例，实际上没有任何时间变化的协变量。）

我们以两种方式进行这样的转换，以进行比较。1) 使用数值天数到事件/截尾；2) 使用日期。

使用天数

ddTv <- tmerge(data1=dd, data2=dd, id=id, 
  tstart=startDay, tstop=endDay, event=event(endDay, death))
ddTv[1:6, ]

id death  startDate    endDate startDay endDay tstart  tstop event
 1  TRUE 2005-04-23 2014-11-29        0 3506.6      0 3506.6  TRUE
 2 FALSE 2010-08-13 2023-02-16        0 4570.6      0 4570.6 FALSE
 3 FALSE 2013-09-11 2023-06-22        0 3571.6      0 3571.6 FALSE
 4 FALSE 2007-08-31 2015-10-03        0 2955.1      0 2955.1 FALSE
 5  TRUE 2019-02-05 2027-01-27        0 2913.4      0 2913.4  TRUE
 6 FALSE 2002-05-14 2012-04-06        0 3615.2      0 3615.2 FALSE

使用日期

ddTvDate <- tmerge(data1=dd, data2=dd, id=id, 
  tstart=startDate, tstop=endDate, event=event(endDate, death))
ddTvDate[1:6, ]

id death  startDate    endDate startDay endDay     tstart      tstop event
 1  TRUE 2005-04-23 2014-11-29        0 3506.6 2005-04-23 2014-11-29  TRUE
 2 FALSE 2010-08-13 2023-02-16        0 4570.6 2010-08-13 2023-02-16 FALSE
 3 FALSE 2013-09-11 2023-06-22        0 3571.6 2013-09-11 2023-06-22 FALSE
 4 FALSE 2007-08-31 2015-10-03        0 2955.1 2007-08-31 2015-10-03 FALSE
 5  TRUE 2019-02-05 2027-01-27        0 2913.4 2019-02-05 2027-01-27  TRUE
 6 FALSE 2002-05-14 2012-04-06        0 3615.2 2002-05-14 2012-04-06 FALSE

最后，使用这两种方式表达相同的数据不会产生相同的回归结果。我们将仅比较空模型：

使用天数

ddMod <- coxph(formula=
    Surv(time=tstart, time2=tstop, event=death) ~ 1, 
  data=ddTv)
ddMod

Null model
  log likelihood= -702.08 
  n= 1000

使用日期

ddModDate <- coxph(formula=
    Surv(time=as.numeric(tstart), time2=as.numeric(tstop), event=death) ~ 1, 
  data=ddTvDate)
ddModDate

Null model
  log likelihood= -681.85 
  n= 1000

对数似然值相似，但不相同。

为什么它们不相同？

如果在模型中添加协变量，那么两个版本之间的系数和p值也不相同。

最后，如果不使用tmerge()，直接在原始数据集上使用coxph()，那么这两种方法都会给出相同的结果。以下两个模型

ddMod2 <- coxph(formula=
    Surv(time=endDay, event=death) ~ 1, 
  data=dd)
ddMod2
ddModDate2 <- coxph(formula=
    Surv(time=as.numeric(endDate - startDate), event=death) ~ 1, 
  data=dd)
ddModDate2

与上面使用天数的ddMod版本给出相同的结果。

英文:

Basically, using tmerge() to create data for time-varying-covariate Cox regression, two ways of expressing times should give the same regression results (I think), but they don't.

One way uses start and end dates, and converts to numeric within Surv(); the other just uses numeric days to event.

Example

First, create some data. We have an ID, an outcome (death), a start date for each row, and an end date some time later. The start date and end date are Date objects.

n &lt;- 1000
set.seed(0)
dd &lt;- data.frame(id=1:n, 
  death=sample(x=c(FALSE, TRUE), prob=c(8, 1), size=n, replace=TRUE), 
  startDate=as.Date(runif(min=as.numeric(as.Date(&quot;2000-01-01&quot;)), 
    max=as.numeric(as.Date(&quot;2019-12-31&quot;)), n=n), origin=&quot;1970-01-01&quot;))
dd$endDate &lt;- as.Date(as.numeric(dd$startDate) + 
    rnorm(mean=3650, sd=500, n=n), origin=&quot;1970-01-01&quot;)
# (You can check that endDate is never before startDate.)

Rather than a start and end date for each participant, we could alternatively start each person's time at zero and have a numeric number of days until event/censor:

dd$startDay &lt;- 0
dd$endDay &lt;- as.numeric(dd$endDate - dd$startDate)

Next, we use tmerge() to transform the data into the format that would be needed for Cox regression with time-varying covariates. (Note: this is a minimal example that does not actually have any time-varying covariates.)

We do this two ways, to compare. 1) Using numeric days to event/censor; 2) Using dates.

Using days

ddTv &lt;- tmerge(data1=dd, data2=dd, id=id, 
  tstart=startDay, tstop=endDay, event=event(endDay, death))
ddTv[1:6, ]

id death  startDate    endDate startDay endDay tstart  tstop event
 1  TRUE 2005-04-23 2014-11-29        0 3506.6      0 3506.6  TRUE
 2 FALSE 2010-08-13 2023-02-16        0 4570.6      0 4570.6 FALSE
 3 FALSE 2013-09-11 2023-06-22        0 3571.6      0 3571.6 FALSE
 4 FALSE 2007-08-31 2015-10-03        0 2955.1      0 2955.1 FALSE
 5  TRUE 2019-02-05 2027-01-27        0 2913.4      0 2913.4  TRUE
 6 FALSE 2002-05-14 2012-04-06        0 3615.2      0 3615.2 FALSE

Using dates

ddTvDate &lt;- tmerge(data1=dd, data2=dd, id=id, 
  tstart=startDate, tstop=endDate, event=event(endDate, death))
ddTvDate[1:6, ]

id death  startDate    endDate startDay endDay     tstart      tstop event
 1  TRUE 2005-04-23 2014-11-29        0 3506.6 2005-04-23 2014-11-29  TRUE
 2 FALSE 2010-08-13 2023-02-16        0 4570.6 2010-08-13 2023-02-16 FALSE
 3 FALSE 2013-09-11 2023-06-22        0 3571.6 2013-09-11 2023-06-22 FALSE
 4 FALSE 2007-08-31 2015-10-03        0 2955.1 2007-08-31 2015-10-03 FALSE
 5  TRUE 2019-02-05 2027-01-27        0 2913.4 2019-02-05 2027-01-27  TRUE
 6 FALSE 2002-05-14 2012-04-06        0 3615.2 2002-05-14 2012-04-06 FALSE

Finally, using these two ways of expressing the same data don't give the same regression results. We'll compare just the null model:

Using days

ddMod &lt;- coxph(formula=
    Surv(time=tstart, time2=tstop, event=death) ~ 1, 
  data=ddTv)
ddMod

Null model
  log likelihood= -702.08 
  n= 1000

Using dates

ddModDate &lt;- coxph(formula=
    Surv(time=as.numeric(tstart), time2=as.numeric(tstop), event=death) ~ 1, 
  data=ddTvDate)
ddModDate

Null model
  log likelihood= -681.85 
  n= 1000

Log-likelihoods are similar, but not the same.

Why are these not the same?

If you add covariates to the model then coefficients and p values between the two versions are again not the same.

Finally, if you don't use tmerge(), and go straight to coxph() on the original dataset, then both methods give you the same results. Both of these models

ddMod2 &lt;- coxph(formula=
    Surv(time=endDay, event=death) ~ 1, 
  data=dd)
ddMod2
ddModDate2 &lt;- coxph(formula=
    Surv(time=as.numeric(endDate - startDate), event=death) ~ 1, 
  data=dd)
ddModDate2

give the same results as ddMod above, the version using days.

答案1

得分: 0

教授Terry M. Therneau（survival包的创建者）友善地给了我一个答案，并允许在这里发布。

简而言之，结果不同是因为这是两个完全不同的模型。例如，考虑一个参与者在研究中的第100天，即2010年1月1日发生了一次事件。

如果我使用日历日期作为时间，那么该事件的风险集合是2010年1月1日参与研究的所有人。
如果我使用入组后的时间作为时间，那么该事件的风险集合是在他们入组后的第100天仍在研究中的所有人。

这可能是非常不同的一组人！

几乎对于每项研究，入组后的时间是您想要的度量标准。

一旦他指出，这就很明显了，但在那之前对我来说不透明。

英文:

Professor Terry M. Therneau (creator of the survival package) kindly gave me an answer, with permission to post here.

Paraphrasing ---

Basically, the results are different because those are two entirely different models. Consider, for example, a participant who had an event on the 100th day that they were in the study, on January 1, 2010.

If I use calendar dates for my times, then the risk set for that event is everyone who was in the study on January 1, 2010.
If I use time since entry for my times, then the risk set for that event is everyone who was still in the study on their 100th day since entry.

Those are probably very different sets of people!

For almost every study, time since entry is the measure you want.

Obvious once he points it out, opaque to me until then.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

tmerge() + coxph()：两种设置日期的方式应该得到相同的结果，而且不会。

问题

答案1

创建一个数据框，由不同长度的元素列表组成。

提取字符串中的前导数字，但长度会变化。

Python Pandas 将按月聚合的数据重新采样为按日，然后再次聚合为按周。

R：自动化解决数据框中的许多方程组

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。