添加一列,指示是否存在多列中至少一个匹配值。

huangapple go评论88阅读模式
英文:

Add a column indicating whether at least one matching value exists across multiple columns

问题

我正在尝试从诊断变量中选择一些ICD-10医疗编码,以检索数据集中特定诊断的集合,但在简洁的方式上遇到了很大的挑战。

为了背景信息,每个参与者ID(PID)有50多个不同的诊断实例,每个都编码为Disease_code_1,Disease_code_2,Disease_code_3等,相应的日期如下所示。

  1. df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
  2. Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
  3. Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
  4. Disease_code_3 = c('H250', NA,NA,NA,NA),
  5. Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
  6. Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
  7. Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))

正如您所看到的,我在disease_code_1中有几个I802代码,还在disease_code2中,我想要将它们包含在我的分析中。

然而,当我使用简单的ifelse函数将这些我想要选择的变量组合在一起时,我不能使用“或”运算符,因为我得到0个参与者。

  1. data$diagnosis <- with(data, ifelse((data$disease_code_1 == "I802||G200||H356"),1,0))
  2. data$diagnosis[is.na(data$diagnosis)] <- 0
  3. sum(data$diagnosis)

我得到了一个总和为0的结果。到目前为止,我唯一能够获得一个数字的方法是使用长格式。虽然这个代码中只有3个实例和3个疾病代码,但每个实例中有50多个实例和大约12个诊断需要计数,这意味着这行代码变得异常长,我希望能更简洁。

  1. data$diagnosis <- with(data, ifelse(((data$disease_code_1 == "I802")|(data$disease_code_1 == "G200")|(data$disease_code_1 == "H356")|(data$disease_code_2 == "I802")|(data$disease_code_2 == "G200")|(data$disease_code_2 == "H356")|(data$disease_code_3 == "I802")|(data$disease_code_3 == "G200")|(data$disease_code_3 == "H356")), 1, 0))
  2. data$diagnosis[is.na(data$diagnosis)] <- 0
  3. sum(data$diagnosis)

这种方法将提供答案,但一旦长度达到一定程度,该行也将停止运行,导致命令行中出现+符号。

是否有一种方法可以使这个过程更加简洁?

英文:

I am attempting to select a number of ICD-10 medical codes from a diagnosis variable to retrieve a set of specific diagnoses within a dataset, however am having a very challenging time doing so in a succinct manner.

For context, there are over 50 different diagnostic instances for every participant ID (PID), each coded as Disease_code_1 , Disease_code_2 , Disease_code_3, etc. with corresponding dates as shown below.

  1. df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
  2. Disease_code_1 = c(&#39;I802&#39;, &#39;G200&#39;,&#39;I802&#39;,NA, &#39;H356&#39;),
  3. Disease_code_2 = c(&#39;A071&#39;,NA,&#39;G20&#39;,NA,&#39;I802&#39;),
  4. Disease_code_3 = c(&#39;H250&#39;, NA,NA,NA,NA),
  5. Date_of_diagnosis_1 = c(&#39;12/06/1997&#39;,&#39;13/06/1997&#39;,&#39;14/02/2003&#39;,NA,&#39;18/03/2005&#39;),
  6. Date_of_diagnosis_2 = c(&#39;12/06/1998&#39;,NA,&#39;18/09/2001&#39;,NA,&#39;12/07/1993&#39;),
  7. Date_of_diagnosis_3 = c(&#39;17/09/2010&#39;,NA,NA,NA,NA))
  8. ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
  9. 1 1001 I802 A071 H250 12/06/1997 12/06/1998 17/09/2010
  10. 2 1002 G200 &lt;NA&gt; &lt;NA&gt; 13/06/1997 &lt;NA&gt; &lt;NA&gt;
  11. 3 1003 I802 G20 &lt;NA&gt; 14/02/2003 18/09/2001 &lt;NA&gt;
  12. 4 1004 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
  13. 5 1005 H356 I802 &lt;NA&gt; 18/03/2005 12/07/1993 &lt;NA&gt;

As can be seen I have several I802 codes within disease_code_1, but also within disease_code2 that I want to include in my analysis.

However, when I combine these variables that I am wanting to select together using a simple ifelse function, I cannot use an 'or' operator as I get 0 participants

  1. data$diagnosis&lt;- with(data, ifelse((data$disease_code_1 == &quot;I802||G200||H356&quot;),1,0))
  2. data$diagnosis[is.na(data$diagnosis)] &lt;- 0
  3. sum(data$diagnosis)

I am left with a sum of 0. The only way I can achieve a number so far is using a long format. Although there are only 3 instances and 3 disease codes in this code, there are over 50 instances and approximately 12 diagnoses that need to be counted for each, meaning this line of code becomes exceptionally long and I would like to make it more succinct.

  1. data$diagnosis &lt;- with(data, ifelse(((data$disease_code_1 == &quot;I802&quot;)|(data$disease_code_1 == &quot;G200&quot;)|(data$disease_code_1 == &quot;H356&quot;)|(data$disease_code_2 == &quot;I802&quot;)|(data$disease_code_2 == &quot;G200&quot;)|(data$disease_code_2 == &quot;H356&quot;)|(data$disease_code_3 == &quot;I802&quot;)|(data$disease_code_3 == &quot;G200&quot;)|(data$disease_code_3 == &quot;H356&quot;)), 1, 0))
  2. data$diagnosis[is.na(data$diagnosis)] &lt;- 0
  3. sum(data$diagnosis)

This method will provide an answer, however once I get to a certain length, the line will also cease to run, resulting in a + symbol within the command line.

Is there a way I can make this much more succinct?

答案1

得分: 3

关于你的代码,有一些注意事项:

  1. 当你使用 with 时,不再需要使用 data$ 语法,只需 with(data, ifelse(disease_code_1 == "I802||G200||H356"),1,0) 就足够了。
  2. 我们不能使用 == 来比较一组字符串。
  3. 在你的代码中,你使用小写的 disease_code,但在你的数据中,你使用了大写的 Disease_code

解决方案

要检查一组字符串的存在,我们可以使用 %in% 运算符,并且可以使用 apply 来迭代数据框的行。

  1. data$diagnosis <- apply(data[startsWith(colnames(data), "Disease_code_")], 1, function(x) as.integer(any(x %in% c("I802", "G200", "H356"))))
  1. ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
  2. 1 1001 I802 A071 H250 12/06/1997 12/06/1998 17/09/2010 1
  3. 2 1002 G200 <NA> <NA> 13/06/1997 <NA> <NA> 1
  4. 3 1003 I802 G20 <NA> 14/02/2003 18/09/2001 <NA> 1
  5. 4 1004 <NA> <NA> <NA> <NA> <NA> <NA> 0
  6. 5 1005 H356 I802 <NA> 18/03/2005 12/07/1993 <NA> 1
  1. sum(data$diagnosis)
  2. [1] 4
英文:

A few notes regarding your codes:

  1. When you use with, you no longer need the data$ syntax, just with(data, ifelse(disease_code_1 == &quot;I802||G200||H356&quot;),1,0) is enough
  2. We cannot use == to compare a set of strings
  3. In your code you have lower-case disease_code but in your data you have upper-case Disease_code

<hr>

Solution

To check the existence of a set of strings, we can use the %in% operator, and we can use apply to iterate over the rows of the data frame.

  1. data$diagnosis &lt;- apply(data[startsWith(colnames(data), &quot;Disease_code_&quot;)], 1, \(x) as.integer(any(x %in% c(&quot;I802&quot;, &quot;G200&quot;, &quot;H356&quot;))))
  2. ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
  3. 1 1001 I802 A071 H250 12/06/1997 12/06/1998 17/09/2010 1
  4. 2 1002 G200 &lt;NA&gt; &lt;NA&gt; 13/06/1997 &lt;NA&gt; &lt;NA&gt; 1
  5. 3 1003 I802 G20 &lt;NA&gt; 14/02/2003 18/09/2001 &lt;NA&gt; 1
  6. 4 1004 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; 0
  7. 5 1005 H356 I802 &lt;NA&gt; 18/03/2005 12/07/1993 &lt;NA&gt; 1
  1. sum(data$diagnosis)
  2. [1] 4

答案2

得分: 3

您可以使用 dplyr::if_any() 并借助 tidyr::replace_na() 进行操作。如 @jared_mamrot 所建议,您可以使用 + 将生成的逻辑向量强制转换为整数。

  1. library(dplyr)
  2. library(tidyr)
  3. data %>%
  4. mutate(diagnosis = replace_na(
  5. +if_any(starts_with("disease_code"), \(x) x %in% c("I802", "G200", "H356")),
  6. 0
  7. ))

或者,省略 +,将其保留为逻辑向量。您仍然可以执行算术操作,以及直接在逻辑测试中使用它(例如 ifelse(diagnosis, ...) 而不是 ifelse(diagnosis == 1, ...))。

英文:

You can use dplyr::if_any() with an assist from tidyr::replace_na(). As suggested by @jared_mamrot, you can use + to coerce the resulting logical vector to integer.

  1. library(dplyr)
  2. library(tidyr)
  3. data %&gt;%
  4. mutate(diagnosis = replace_na(
  5. +if_any(starts_with(&quot;disease_code&quot;), \(x) x %in% c(&quot;I802&quot;, &quot;G200&quot;, &quot;H356&quot;)),
  6. 0
  7. ))

Alternatively, omit the + and just keep as a logical vector. You’ll still be able to do arithmetic operations, plus use it directly in logical tests (i.e., ifelse(diagnosis, ...) rather than ifelse(diagnosis == 1, ...)).

答案3

得分: 1

我会这样处理:将原始数据框转换为长格式,然后从中选择行:

  1. library(tidyr)
  2. library(dplyr)
  3. df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
  4. Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
  5. Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
  6. Disease_code_3 = c('H250', NA,NA,NA,NA),
  7. Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
  8. Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
  9. Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))
  10. # 转换为长格式,将所有疾病代码放在一列中
  11. dflong <- pivot_longer(df, -ID, names_sep = "_(?=\\d)", names_to=c(".value", "count"), values_to= "value")
  12. # 找到匹配感兴趣的代码的行
  13. hascondition <- which(dflong$Disease_code %in% c("I802", "G200", "H356"))
  14. # 获取患者ID列表
  15. patients <- unique(dflong$ID[hascondition])
  16. length(patients)
  17. #[1] 4
  18. # 过滤/子集化原始数据框
  19. subset <- df %>% filter(ID %in% patients)
英文:

The way I would approach this is to convert your original data frame into a long format and then select the rows from there:

  1. library(tidyr)
  2. library(dplyr)
  3. df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
  4. Disease_code_1 = c(&#39;I802&#39;, &#39;G200&#39;,&#39;I802&#39;,NA, &#39;H356&#39;),
  5. Disease_code_2 = c(&#39;A071&#39;,NA,&#39;G20&#39;,NA,&#39;I802&#39;),
  6. Disease_code_3 = c(&#39;H250&#39;, NA,NA,NA,NA),
  7. Date_of_diagnosis_1 = c(&#39;12/06/1997&#39;,&#39;13/06/1997&#39;,&#39;14/02/2003&#39;,NA,&#39;18/03/2005&#39;),
  8. Date_of_diagnosis_2 = c(&#39;12/06/1998&#39;,NA,&#39;18/09/2001&#39;,NA,&#39;12/07/1993&#39;),
  9. Date_of_diagnosis_3 = c(&#39;17/09/2010&#39;,NA,NA,NA,NA))
  10. #convert to long format, all Disease codes in one column
  11. dflong&lt;- pivot_longer(df, -ID, names_sep = &quot;_(?=\\d)&quot;, names_to=c(&quot;.value&quot;, &quot;count&quot;), values_to= &quot;value&quot;)
  12. #find the rows that matches the code(s) of interest
  13. hascondition &lt;- which(dflong$Disease_code %in% c(&quot;I802&quot;, &quot;G200&quot;, &quot;H356&quot;))
  14. #get a list of patient IDs
  15. patients &lt;- unique(dflong$ID[hascondition])
  16. length(patients)
  17. #[1] 4
  18. #filter/subset the original data frame
  19. subset &lt;- df %&gt;% filter(ID %in% patients)

答案4

得分: 0

data.table 中:

  1. library(data.table)
  2. setDT(df)[ , diagnosis := +any(.SD %in% c("I802", "G200", "H356")),
  3. .(ID), ## or seq_len(NROW(df)
  4. .SDcols = names(df)[grep("Disease_code", names(df))]][]
  1. #&gt; ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1
  2. #&gt; 1: 1001 I802 A071 H250 12/06/1997
  3. #&gt; 2: 1002 G200 &lt;NA&gt; &lt;NA&gt; 13/06/1997
  4. #&gt; 3: 1003 I802 G20 &lt;NA&gt; 14/02/2003
  5. #&gt; 4: 1004 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
  6. #&gt; 5: 1005 H356 I802 &lt;NA&gt; 18/03/2005
  7. #&gt; Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
  8. #&gt; 1: 12/06/1998 17/09/2010 1
  9. #&gt; 2: &lt;NA&gt; &lt;NA&gt; 1
  10. #&gt; 3: 18/09/2001 &lt;NA&gt; 1
  11. #&gt; 4: &lt;NA&gt; &lt;NA&gt; 0
  12. #&gt; 5: 12/07/1993 &lt;NA&gt; 1
英文:

In data.table:

  1. library(data.table)
  2. setDT(df)[ , diagnosis := +any(.SD %in% c(&quot;I802&quot;, &quot;G200&quot;, &quot;H356&quot;)),
  3. .(ID), ## or seq_len(NROW(df)
  4. .SDcols = names(df)[grep(&quot;Disease_code&quot;, names(df))]][]
  1. #&gt; ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1
  2. #&gt; 1: 1001 I802 A071 H250 12/06/1997
  3. #&gt; 2: 1002 G200 &lt;NA&gt; &lt;NA&gt; 13/06/1997
  4. #&gt; 3: 1003 I802 G20 &lt;NA&gt; 14/02/2003
  5. #&gt; 4: 1004 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
  6. #&gt; 5: 1005 H356 I802 &lt;NA&gt; 18/03/2005
  7. #&gt; Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
  8. #&gt; 1: 12/06/1998 17/09/2010 1
  9. #&gt; 2: &lt;NA&gt; &lt;NA&gt; 1
  10. #&gt; 3: 18/09/2001 &lt;NA&gt; 1
  11. #&gt; 4: &lt;NA&gt; &lt;NA&gt; 0
  12. #&gt; 5: 12/07/1993 &lt;NA&gt; 1

huangapple
  • 本文由 发表于 2023年5月25日 10:37:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76328545.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定