英文:
How to resolve an issue of large confidence intervals while running CoxPH analysis in R?
问题
我在使用以下示例数据集执行CoxPH分析时遇到了问题:
structure(list(Systemic.Tx...2.classification..Chemotherapy..PD1.monotherapy..PD.1.CTLA.4.combo..PD.1.chemo..targetted.Tx..targetted.chemo.combo..etc.
= c("靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗",
"靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗",
"靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗",
"靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗",
"靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗", "靶向治疗"), Time.on.systemic.Tx =
c("2.069815195", "2.332648871", "2.069815195", "1.215605749",
"2.661190965", "0.689938398", "1.839835729", "2.858316222",
"0.657084189", "2.529774127", "1.80698152", "3.482546201",
"2.891170431", "3.515400411", "2.431211499", "3.515400411",
"1.347022587", "5.519507187", "17.47843943", "26.90759754",
"6.176591376", "5.979466119", "8.246406571", "15.40862423",
"5.749486653", "6.242299795", "5.683778234", "6.636550308",
"10.15195072", "10.0862423", "18.52977413", "5.749486653",
"10.7761807", "6.965092402"), PFS2 = c(2.595482546, 2.37, 2.069815195,
1.412731006, 1.938398357, 0.657084189, 2.529774127, 3.219712526,
0.657084189, 2.529774127, 2.2, 3.482546201, 2.529774127, 3.712525667,
2.234086242, 3.778234086, 1.347022587, 5.55, 17.3798768, 30.32443532,
7.12936345, 7.09650924, 8.246406571, 15.24435318, 5.519507187,
5.749486653, 5.420944559, 6.636550308, 9.264887064, 10.02053388,
18.20123203, 6.110882957, 10.61190965, 6.866529774), PFS2_event = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 1, 1), Binarised_Time.on.Tx.2 = c("≤ 3.52
months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52
months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52
months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52
months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52
months", "> 3.52 months", "> 3.52 months", "> 3.52 months", "> 3.52
months", "> 3.52 months", "> 3.52 months", "> 3.52 months", "> 3.52
months", "> 3.52 months", "> 3.52 months", "> 3.52 months", "> 3.52
months", "> 3.52 months", "> 3.52 months", "> 3.52 months", "> 3.52
months", "> 3.52 months")), row.names = c(NA, -34L), class =
"data.frame")
这是我用于此分析的代码:
fit1 <- coxph(Surv(PFS2, PFS2_event) ~ Binarised_Time.on.Tx.2, data =
Test_Dataset)
summary(fit1)
在运行此代码后,我收到了以下警告:
警告信息:在 coxph.fit(X, Y, istrat, offset, init, control, weights = weights, : 对于变量 1,对数似然在变量达到收敛之前就已经达到,系数可能为无穷大。
更重要的是,我收到了不正确的结果,因为置信区间从0到无穷大,系数和p值都非常高。我已经在相同数据集上使用相同的方法进行了总生存期分析,没有任何问题。对于我的PFS2值,
英文:
I am running into an issue while performing CoxPH analysis using the following sample dataset:
structure(list(Systemic.Tx...2.classification..Chemotherapy..PD1.monotherapy..PD.1.CTLA.4.combo..PD.1.chemo..targetted.Tx..targetted.chemo.combo..etc.
= c("Targetted Tx", "Targetted Tx", "Targetted Tx", "Targetted Tx", "Targetted Tx", "Targetted Tx", "Targetted Tx", "Targetted Tx",
"Targetted Tx", "Targetted Tx", "Targetted Tx", "Targetted Tx",
"Targetted Tx", "Targetted Tx", "Targetted Tx", "Targetted Tx",
"Targetted Tx", "Targetted Tx", "Targetted Tx", "Targetted Tx",
"Targetted Tx", "Targetted Tx", "Targetted Tx", "Targetted Tx",
"Targetted Tx", "Targetted Tx", "Targetted Tx", "Targetted Tx",
"Targetted Tx", "Targetted Tx", "Targetted/chemo combo", "Targetted Tx", "Targetted Tx", "Targetted Tx"), Time.on.systemic.Tx =
c("2.069815195", "2.332648871", "2.069815195", "1.215605749",
"2.661190965", "0.689938398", "1.839835729", "2.858316222",
"0.657084189", "2.529774127", "1.80698152", "3.482546201",
"2.891170431", "3.515400411", "2.431211499", "3.515400411",
"1.347022587", "5.519507187", "17.47843943", "26.90759754",
"6.176591376", "5.979466119", "8.246406571", "15.40862423",
"5.749486653", "6.242299795", "5.683778234", "6.636550308",
"10.15195072", "10.0862423", "18.52977413", "5.749486653",
"10.7761807", "6.965092402"), PFS2 = c(2.595482546, 2.37, 2.069815195,
1.412731006, 1.938398357, 0.657084189, 2.529774127, 3.219712526,
0.657084189, 2.529774127, 2.2, 3.482546201, 2.529774127, 3.712525667,
2.234086242, 3.778234086, 1.347022587, 5.55, 17.3798768, 30.32443532,
7.12936345, 7.09650924, 8.246406571, 15.24435318, 5.519507187,
5.749486653, 5.420944559, 6.636550308, 9.264887064, 10.02053388,
18.20123203, 6.110882957, 10.61190965, 6.866529774), PFS2_event = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 1, 1), Binarised_Time.on.Tx.2 = c("≤ 3.52
months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52
months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52
months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52
months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52 months", "≤ 3.52
months", "> 3.52 months", "> 3.52 months", "> 3.52 months", "> 3.52
months", "> 3.52 months", "> 3.52 months", "> 3.52 months", "> 3.52
months", "> 3.52 months", "> 3.52 months", "> 3.52 months", "> 3.52
months", "> 3.52 months", "> 3.52 months", "> 3.52 months", "> 3.52
months", "> 3.52 months")), row.names = c(NA, -34L), class =
"data.frame")
And here is the code I am using for this analysis:
fit1 <- coxph(Surv(PFS2, PFS2_event) ~ Binarised_Time.on.Tx.2, data =
Test_Dataset)
summary(fit1)
I receive the following warning after running this code:
> Warning message: In coxph.fit(X, Y, istrat, offset, init, control, weights = weights, : Loglik converged before variable 1 ;
coefficient may be infinite.
And more importantly I am receiving incorrect results, since the confidence interval goes from 0 to Inf and the co-efficient and p-values are really high. I have run this analysis for Overall Survival using the same dataset which has worked well without any issues. Any suggestions as to what might be driving this issue with respect to my PFS2 values?
答案1
得分: 1
这是完全分离问题的一个变种,你可以开始阅读关于它的信息(例如)这里。
这些不是真正的不正确估计,它们是显示无限估计的尝试。在这种情况下,瓦尔德估计的标准误差失败了(这被称为豪克-唐纳效应)。
一些可能的解决方案:
- 你仍然可以使用
anova.coxph
来比较拟合与零模型的拟合,并以此获得有效的 p 值 - 考虑不对预测变量进行二分化...
- 拟合一个正则化模型,例如使用
glmnet
包,采用岭回归惩罚(alpha = 0
)和一个小的惩罚
最容易通过绘制数据(使用 Kaplan-Meier 估计)来看到:
library(ggfortify)
fit2 <- survfit(Surv(PFS2, PFS2_event) ~ Binarised_Time.on.Tx.2, data =
Test_Dataset)
autoplot(fit2)
所有"≤3.52"层的个体在另一层的第一个个体死亡之前都会死亡(失败)或被截尾...
我们也可以绘制拟合的 Cox 模型(使用 autoplot(survfit(fit))
),尽管发生的情况不太明显...
英文:
This is a variant of the complete separation problem, which you can start to read about (e.g.) here.
These aren't really incorrect estimates, they're the attempt to show infinite estimates. The Wald estimates of the standard errors fail in this case (this is called the Hauck-Donner effect).
Some possible solutions:
- you can still use
anova.coxph
to compare the fit to the fit of a null model and get a valid p-value that way - consider not dichotomizing your predictor ...
- fit a regularized model, e.g. using the
glmnet
package with a ridge penalty (alpha = 0
) and a small penalty
Easiest to see by plotting the data (using a Kaplan-Meier estimate):
library(ggfortify)
fit2 <- survfit(Surv(PFS2, PFS2_event) ~ Binarised_Time.on.Tx.2, data =
Test_Dataset)
autoplot(fit2)
All of the individuals in the "≤3.52" stratum die (fail) or are censored before the first individual in the other stratum dies ...
We can plot the fitted Cox model (with autoplot(survfit(fit))
) too, although it's less obvious what's going on ...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论