英文:
Why does the survival probability of the survival package return 0% at the end of the time horizon when there are survivors in the dataset?
问题
我刚刚开始使用R中的survival
和survminer
包,并试图理解它的输出。在下面的代码中,我创建了一个包含实际数据集前12行的数据框,作为问题的代表。在这个代表性数据中:
- ID = 每个元素的唯一标识符
- time = 元素的生存时间,以月为单位,其中值> 0表示死亡(死亡发生的月份),值= 0表示在研究期间没有死亡(右侧截尾)
- status = 元素的截尾状态,其中1=截尾,2=死亡
- node = 与每个元素相关的变量之一,我试图评估它与死亡概率的关系
运行length(which(testDF$status == 2))/nrow(testDF)
显示了这些数据的死亡率为66.67%,但下面图像中显示的生存概率曲线在0%结束。它们不应该至少以所有数据的平均值66.67%结束吗?我在这里做错了什么,还是我误解了生存概率?
代码:
library(ggplot2)
library(survival)
library(survminer)
testDF <- data.frame(
ID = 1:12,
time = c(0,34,0,12,12,21,0,0,39,11,13,26),
status = c(1,2,1,2,2,2,1,1,2,2,2,2),
node = c("C","C","B","A","C","C","B","C","B","C","A","B")
)
fit <- survfit(Surv(time, status) ~ node, data = testDF)
ggsurvplot(fit,
pval = TRUE,
conf.int = TRUE,
linetype = "strata",
surv.median.line = "hv",
ggtheme = theme_bw()
)
# 死亡百分比
length(which(testDF$status == 2))/nrow(testDF)
英文:
I've just started using the survival
and survminer
packages in R and am trying to understand its output. In the code below I create a dataframe with the first 12 rows of my actual dataset, as representative of the issue/question. In this representative data:
- ID = unique identifier for each element
- time = survival time for the element in months where value > 0 means death (the month that death occurs) and value = 0 means no death (right censored) during the study period
- status = the element's censoring status where 1=censored and 2=dead
- node = one of the variables associated with each element where I try to assess its association with the probability of death
Running length(which(testDF$status == 2))/nrow(testDF)
shows a death rate of 66.67% with this data, but the survival probability curves shown in the image below end at 0%. Should they not be ending at 66.67% at least for the average of all the data? What am I doing wrong here or am I misinterpreting survival probability?
Code:
library(ggplot2)
library(survival)
library(survminer)
testDF <- data.frame(
ID = 1:12,
time = c(0,34,0,12,12,21,0,0,39,11,13,26),
status = c(1,2,1,2,2,2,1,1,2,2,2,2),
node = c("C","C","B","A","C","C","B","C","B","C","A","B")
)
fit <- survfit(Surv(time, status) ~ node, data = testDF)
ggsurvplot(fit,
pval = TRUE,
conf.int = TRUE,
linetype = "strata",
surv.median.line = "hv",
ggtheme = theme_bw()
)
# percentage of deaths
length(which(testDF$status == 2))/nrow(testDF)
答案1
得分: 0
我将不会提供代码的翻译,只是返回你想要的翻译部分:
我的对“时间”列中的censored观察(没有死亡,幸存者)进行零编码的代码是错误的,正如Edward在他的评论中指出的那样。现在,我重新对那些幸存者的观察进行编码,研究期限为40个月。我还重新运行绘图,去除置信区间以提高解决方案的清晰度。
英文:
My coding of censored (no death, the survivors) observations in the "time" column with 0's was incorrect as Edward points out in his comments. Now I recode those survivor observations with the time length of the study of 40. I also re-run the plot without confidence intervals for solution clarity.
testDF <- data.frame(
ID = 1:12,
time = c(40,34,40,12,12,21,40,40,39,11,13,26), # 40 month study window (0's for no death changed to 40)
status = c(0,1,0,1,1,1,0,0,1,1,1,1), # 0 = censored, 1 = death
node = c("C","C","B","A","C","C","B","C","B","C","A","B")
)
# survival rates: total = 33.3%, node A = 0%, node B = 50%, node C = 33.3%
length(which(testDF$status == 0))/nrow(testDF)
length(which(testDF$status == 0 & testDF$node == "A"))/length(which(testDF$node == "A"))
length(which(testDF$status == 0 & testDF$node == "B"))/length(which(testDF$node == "B"))
length(which(testDF$status == 0 & testDF$node == "C"))/length(which(testDF$node == "C"))
Plot running the above revised DF:
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论