英文:
Add a column indicating whether at least one matching value exists across multiple columns
问题
我正在尝试从诊断变量中选择一些ICD-10医疗编码,以检索数据集中特定诊断的集合,但在简洁的方式上遇到了很大的挑战。
为了背景信息,每个参与者ID(PID)有50多个不同的诊断实例,每个都编码为Disease_code_1,Disease_code_2,Disease_code_3等,相应的日期如下所示。
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
Disease_code_3 = c('H250', NA,NA,NA,NA),
Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))
正如您所看到的,我在disease_code_1中有几个I802
代码,还在disease_code2中,我想要将它们包含在我的分析中。
然而,当我使用简单的ifelse
函数将这些我想要选择的变量组合在一起时,我不能使用“或”运算符,因为我得到0个参与者。
data$diagnosis <- with(data, ifelse((data$disease_code_1 == "I802||G200||H356"),1,0))
data$diagnosis[is.na(data$diagnosis)] <- 0
sum(data$diagnosis)
我得到了一个总和为0的结果。到目前为止,我唯一能够获得一个数字的方法是使用长格式。虽然这个代码中只有3个实例和3个疾病代码,但每个实例中有50多个实例和大约12个诊断需要计数,这意味着这行代码变得异常长,我希望能更简洁。
data$diagnosis <- with(data, ifelse(((data$disease_code_1 == "I802")|(data$disease_code_1 == "G200")|(data$disease_code_1 == "H356")|(data$disease_code_2 == "I802")|(data$disease_code_2 == "G200")|(data$disease_code_2 == "H356")|(data$disease_code_3 == "I802")|(data$disease_code_3 == "G200")|(data$disease_code_3 == "H356")), 1, 0))
data$diagnosis[is.na(data$diagnosis)] <- 0
sum(data$diagnosis)
这种方法将提供答案,但一旦长度达到一定程度,该行也将停止运行,导致命令行中出现+
符号。
是否有一种方法可以使这个过程更加简洁?
英文:
I am attempting to select a number of ICD-10 medical codes from a diagnosis variable to retrieve a set of specific diagnoses within a dataset, however am having a very challenging time doing so in a succinct manner.
For context, there are over 50 different diagnostic instances for every participant ID (PID), each coded as Disease_code_1 , Disease_code_2 , Disease_code_3, etc. with corresponding dates as shown below.
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
Disease_code_3 = c('H250', NA,NA,NA,NA),
Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))
ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
1 1001 I802 A071 H250 12/06/1997 12/06/1998 17/09/2010
2 1002 G200 <NA> <NA> 13/06/1997 <NA> <NA>
3 1003 I802 G20 <NA> 14/02/2003 18/09/2001 <NA>
4 1004 <NA> <NA> <NA> <NA> <NA> <NA>
5 1005 H356 I802 <NA> 18/03/2005 12/07/1993 <NA>
As can be seen I have several I802
codes within disease_code_1, but also within disease_code2 that I want to include in my analysis.
However, when I combine these variables that I am wanting to select together using a simple ifelse
function, I cannot use an 'or' operator as I get 0 participants
data$diagnosis<- with(data, ifelse((data$disease_code_1 == "I802||G200||H356"),1,0))
data$diagnosis[is.na(data$diagnosis)] <- 0
sum(data$diagnosis)
I am left with a sum of 0. The only way I can achieve a number so far is using a long format. Although there are only 3 instances and 3 disease codes in this code, there are over 50 instances and approximately 12 diagnoses that need to be counted for each, meaning this line of code becomes exceptionally long and I would like to make it more succinct.
data$diagnosis <- with(data, ifelse(((data$disease_code_1 == "I802")|(data$disease_code_1 == "G200")|(data$disease_code_1 == "H356")|(data$disease_code_2 == "I802")|(data$disease_code_2 == "G200")|(data$disease_code_2 == "H356")|(data$disease_code_3 == "I802")|(data$disease_code_3 == "G200")|(data$disease_code_3 == "H356")), 1, 0))
data$diagnosis[is.na(data$diagnosis)] <- 0
sum(data$diagnosis)
This method will provide an answer, however once I get to a certain length, the line will also cease to run, resulting in a +
symbol within the command line.
Is there a way I can make this much more succinct?
答案1
得分: 3
关于你的代码,有一些注意事项:
- 当你使用
with
时,不再需要使用data$
语法,只需with(data, ifelse(disease_code_1 == "I802||G200||H356"),1,0)
就足够了。 - 我们不能使用
==
来比较一组字符串。 - 在你的代码中,你使用小写的
disease_code
,但在你的数据中,你使用了大写的Disease_code
。
解决方案
要检查一组字符串的存在,我们可以使用 %in%
运算符,并且可以使用 apply
来迭代数据框的行。
data$diagnosis <- apply(data[startsWith(colnames(data), "Disease_code_")], 1, function(x) as.integer(any(x %in% c("I802", "G200", "H356"))))
ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
1 1001 I802 A071 H250 12/06/1997 12/06/1998 17/09/2010 1
2 1002 G200 <NA> <NA> 13/06/1997 <NA> <NA> 1
3 1003 I802 G20 <NA> 14/02/2003 18/09/2001 <NA> 1
4 1004 <NA> <NA> <NA> <NA> <NA> <NA> 0
5 1005 H356 I802 <NA> 18/03/2005 12/07/1993 <NA> 1
sum(data$diagnosis)
[1] 4
英文:
A few notes regarding your codes:
- When you use
with
, you no longer need thedata$
syntax, justwith(data, ifelse(disease_code_1 == "I802||G200||H356"),1,0)
is enough - We cannot use
==
to compare a set of strings - In your code you have lower-case
disease_code
but in yourdata
you have upper-caseDisease_code
<hr>
Solution
To check the existence of a set of strings, we can use the %in%
operator, and we can use apply
to iterate over the rows of the data frame.
data$diagnosis <- apply(data[startsWith(colnames(data), "Disease_code_")], 1, \(x) as.integer(any(x %in% c("I802", "G200", "H356"))))
ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
1 1001 I802 A071 H250 12/06/1997 12/06/1998 17/09/2010 1
2 1002 G200 <NA> <NA> 13/06/1997 <NA> <NA> 1
3 1003 I802 G20 <NA> 14/02/2003 18/09/2001 <NA> 1
4 1004 <NA> <NA> <NA> <NA> <NA> <NA> 0
5 1005 H356 I802 <NA> 18/03/2005 12/07/1993 <NA> 1
sum(data$diagnosis)
[1] 4
答案2
得分: 3
您可以使用 dplyr::if_any()
并借助 tidyr::replace_na()
进行操作。如 @jared_mamrot 所建议,您可以使用 +
将生成的逻辑向量强制转换为整数。
library(dplyr)
library(tidyr)
data %>%
mutate(diagnosis = replace_na(
+if_any(starts_with("disease_code"), \(x) x %in% c("I802", "G200", "H356")),
0
))
或者,省略 +
,将其保留为逻辑向量。您仍然可以执行算术操作,以及直接在逻辑测试中使用它(例如 ifelse(diagnosis, ...)
而不是 ifelse(diagnosis == 1, ...)
)。
英文:
You can use dplyr::if_any()
with an assist from tidyr::replace_na()
. As suggested by @jared_mamrot, you can use +
to coerce the resulting logical vector to integer.
library(dplyr)
library(tidyr)
data %>%
mutate(diagnosis = replace_na(
+if_any(starts_with("disease_code"), \(x) x %in% c("I802", "G200", "H356")),
0
))
Alternatively, omit the +
and just keep as a logical vector. You’ll still be able to do arithmetic operations, plus use it directly in logical tests (i.e., ifelse(diagnosis, ...)
rather than ifelse(diagnosis == 1, ...)
).
答案3
得分: 1
我会这样处理:将原始数据框转换为长格式,然后从中选择行:
library(tidyr)
library(dplyr)
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
Disease_code_3 = c('H250', NA,NA,NA,NA),
Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))
# 转换为长格式,将所有疾病代码放在一列中
dflong <- pivot_longer(df, -ID, names_sep = "_(?=\\d)", names_to=c(".value", "count"), values_to= "value")
# 找到匹配感兴趣的代码的行
hascondition <- which(dflong$Disease_code %in% c("I802", "G200", "H356"))
# 获取患者ID列表
patients <- unique(dflong$ID[hascondition])
length(patients)
#[1] 4
# 过滤/子集化原始数据框
subset <- df %>% filter(ID %in% patients)
英文:
The way I would approach this is to convert your original data frame into a long format and then select the rows from there:
library(tidyr)
library(dplyr)
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
Disease_code_3 = c('H250', NA,NA,NA,NA),
Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))
#convert to long format, all Disease codes in one column
dflong<- pivot_longer(df, -ID, names_sep = "_(?=\\d)", names_to=c(".value", "count"), values_to= "value")
#find the rows that matches the code(s) of interest
hascondition <- which(dflong$Disease_code %in% c("I802", "G200", "H356"))
#get a list of patient IDs
patients <- unique(dflong$ID[hascondition])
length(patients)
#[1] 4
#filter/subset the original data frame
subset <- df %>% filter(ID %in% patients)
答案4
得分: 0
在 data.table
中:
library(data.table)
setDT(df)[ , diagnosis := +any(.SD %in% c("I802", "G200", "H356")),
.(ID), ## or seq_len(NROW(df)
.SDcols = names(df)[grep("Disease_code", names(df))]][]
#> ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1
#> 1: 1001 I802 A071 H250 12/06/1997
#> 2: 1002 G200 <NA> <NA> 13/06/1997
#> 3: 1003 I802 G20 <NA> 14/02/2003
#> 4: 1004 <NA> <NA> <NA> <NA>
#> 5: 1005 H356 I802 <NA> 18/03/2005
#> Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
#> 1: 12/06/1998 17/09/2010 1
#> 2: <NA> <NA> 1
#> 3: 18/09/2001 <NA> 1
#> 4: <NA> <NA> 0
#> 5: 12/07/1993 <NA> 1
英文:
In data.table
:
library(data.table)
setDT(df)[ , diagnosis := +any(.SD %in% c("I802", "G200", "H356")),
.(ID), ## or seq_len(NROW(df)
.SDcols = names(df)[grep("Disease_code", names(df))]][]
#> ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1
#> 1: 1001 I802 A071 H250 12/06/1997
#> 2: 1002 G200 <NA> <NA> 13/06/1997
#> 3: 1003 I802 G20 <NA> 14/02/2003
#> 4: 1004 <NA> <NA> <NA> <NA>
#> 5: 1005 H356 I802 <NA> 18/03/2005
#> Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
#> 1: 12/06/1998 17/09/2010 1
#> 2: <NA> <NA> 1
#> 3: 18/09/2001 <NA> 1
#> 4: <NA> <NA> 0
#> 5: 12/07/1993 <NA> 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论