添加一列,指示是否存在多列中至少一个匹配值。

huangapple go评论68阅读模式
英文:

Add a column indicating whether at least one matching value exists across multiple columns

问题

我正在尝试从诊断变量中选择一些ICD-10医疗编码,以检索数据集中特定诊断的集合,但在简洁的方式上遇到了很大的挑战。

为了背景信息,每个参与者ID(PID)有50多个不同的诊断实例,每个都编码为Disease_code_1,Disease_code_2,Disease_code_3等,相应的日期如下所示。

df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
                Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
                Disease_code_3 = c('H250', NA,NA,NA,NA),
                Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
                Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
                Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))

正如您所看到的,我在disease_code_1中有几个I802代码,还在disease_code2中,我想要将它们包含在我的分析中。

然而,当我使用简单的ifelse函数将这些我想要选择的变量组合在一起时,我不能使用“或”运算符,因为我得到0个参与者。

data$diagnosis <- with(data, ifelse((data$disease_code_1 == "I802||G200||H356"),1,0))
data$diagnosis[is.na(data$diagnosis)] <- 0
sum(data$diagnosis)

我得到了一个总和为0的结果。到目前为止,我唯一能够获得一个数字的方法是使用长格式。虽然这个代码中只有3个实例和3个疾病代码,但每个实例中有50多个实例和大约12个诊断需要计数,这意味着这行代码变得异常长,我希望能更简洁。

data$diagnosis <- with(data, ifelse(((data$disease_code_1 == "I802")|(data$disease_code_1 == "G200")|(data$disease_code_1 == "H356")|(data$disease_code_2 == "I802")|(data$disease_code_2 == "G200")|(data$disease_code_2 == "H356")|(data$disease_code_3 == "I802")|(data$disease_code_3 == "G200")|(data$disease_code_3 == "H356")), 1, 0))
data$diagnosis[is.na(data$diagnosis)] <- 0
sum(data$diagnosis)

这种方法将提供答案,但一旦长度达到一定程度,该行也将停止运行,导致命令行中出现+符号。

是否有一种方法可以使这个过程更加简洁?

英文:

I am attempting to select a number of ICD-10 medical codes from a diagnosis variable to retrieve a set of specific diagnoses within a dataset, however am having a very challenging time doing so in a succinct manner.

For context, there are over 50 different diagnostic instances for every participant ID (PID), each coded as Disease_code_1 , Disease_code_2 , Disease_code_3, etc. with corresponding dates as shown below.

df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c(&#39;I802&#39;, &#39;G200&#39;,&#39;I802&#39;,NA, &#39;H356&#39;),
                Disease_code_2 = c(&#39;A071&#39;,NA,&#39;G20&#39;,NA,&#39;I802&#39;),
                Disease_code_3 = c(&#39;H250&#39;, NA,NA,NA,NA),
                Date_of_diagnosis_1 = c(&#39;12/06/1997&#39;,&#39;13/06/1997&#39;,&#39;14/02/2003&#39;,NA,&#39;18/03/2005&#39;),
                Date_of_diagnosis_2 = c(&#39;12/06/1998&#39;,NA,&#39;18/09/2001&#39;,NA,&#39;12/07/1993&#39;),
                Date_of_diagnosis_3 = c(&#39;17/09/2010&#39;,NA,NA,NA,NA))

    ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
1 1001           I802           A071           H250          12/06/1997          12/06/1998          17/09/2010
2 1002           G200           &lt;NA&gt;           &lt;NA&gt;          13/06/1997                &lt;NA&gt;                &lt;NA&gt;
3 1003           I802            G20           &lt;NA&gt;          14/02/2003          18/09/2001                &lt;NA&gt;
4 1004           &lt;NA&gt;           &lt;NA&gt;           &lt;NA&gt;                &lt;NA&gt;                &lt;NA&gt;                &lt;NA&gt;
5 1005           H356           I802           &lt;NA&gt;          18/03/2005          12/07/1993                &lt;NA&gt;

As can be seen I have several I802 codes within disease_code_1, but also within disease_code2 that I want to include in my analysis.

However, when I combine these variables that I am wanting to select together using a simple ifelse function, I cannot use an 'or' operator as I get 0 participants

data$diagnosis&lt;- with(data, ifelse((data$disease_code_1 == &quot;I802||G200||H356&quot;),1,0))
data$diagnosis[is.na(data$diagnosis)] &lt;- 0
sum(data$diagnosis)

I am left with a sum of 0. The only way I can achieve a number so far is using a long format. Although there are only 3 instances and 3 disease codes in this code, there are over 50 instances and approximately 12 diagnoses that need to be counted for each, meaning this line of code becomes exceptionally long and I would like to make it more succinct.

data$diagnosis &lt;- with(data, ifelse(((data$disease_code_1 == &quot;I802&quot;)|(data$disease_code_1 == &quot;G200&quot;)|(data$disease_code_1 == &quot;H356&quot;)|(data$disease_code_2 == &quot;I802&quot;)|(data$disease_code_2 == &quot;G200&quot;)|(data$disease_code_2 == &quot;H356&quot;)|(data$disease_code_3 == &quot;I802&quot;)|(data$disease_code_3 == &quot;G200&quot;)|(data$disease_code_3 == &quot;H356&quot;)), 1, 0))
data$diagnosis[is.na(data$diagnosis)] &lt;- 0
sum(data$diagnosis)

This method will provide an answer, however once I get to a certain length, the line will also cease to run, resulting in a + symbol within the command line.

Is there a way I can make this much more succinct?

答案1

得分: 3

关于你的代码,有一些注意事项:

  1. 当你使用 with 时,不再需要使用 data$ 语法,只需 with(data, ifelse(disease_code_1 == "I802||G200||H356"),1,0) 就足够了。
  2. 我们不能使用 == 来比较一组字符串。
  3. 在你的代码中,你使用小写的 disease_code,但在你的数据中,你使用了大写的 Disease_code

解决方案

要检查一组字符串的存在,我们可以使用 %in% 运算符,并且可以使用 apply 来迭代数据框的行。

data$diagnosis <- apply(data[startsWith(colnames(data), "Disease_code_")], 1, function(x) as.integer(any(x %in% c("I802", "G200", "H356"))))
ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
1 1001           I802           A071           H250          12/06/1997          12/06/1998          17/09/2010         1
2 1002           G200           <NA>           <NA>          13/06/1997                <NA>                <NA>         1
3 1003           I802            G20           <NA>          14/02/2003          18/09/2001                <NA>         1
4 1004           <NA>           <NA>           <NA>                <NA>                <NA>                <NA>         0
5 1005           H356           I802           <NA>          18/03/2005          12/07/1993                <NA>         1
sum(data$diagnosis)
[1] 4
英文:

A few notes regarding your codes:

  1. When you use with, you no longer need the data$ syntax, just with(data, ifelse(disease_code_1 == &quot;I802||G200||H356&quot;),1,0) is enough
  2. We cannot use == to compare a set of strings
  3. In your code you have lower-case disease_code but in your data you have upper-case Disease_code

<hr>

Solution

To check the existence of a set of strings, we can use the %in% operator, and we can use apply to iterate over the rows of the data frame.

data$diagnosis &lt;- apply(data[startsWith(colnames(data), &quot;Disease_code_&quot;)], 1, \(x) as.integer(any(x %in% c(&quot;I802&quot;, &quot;G200&quot;, &quot;H356&quot;))))

    ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
1 1001           I802           A071           H250          12/06/1997          12/06/1998          17/09/2010         1
2 1002           G200           &lt;NA&gt;           &lt;NA&gt;          13/06/1997                &lt;NA&gt;                &lt;NA&gt;         1
3 1003           I802            G20           &lt;NA&gt;          14/02/2003          18/09/2001                &lt;NA&gt;         1
4 1004           &lt;NA&gt;           &lt;NA&gt;           &lt;NA&gt;                &lt;NA&gt;                &lt;NA&gt;                &lt;NA&gt;         0
5 1005           H356           I802           &lt;NA&gt;          18/03/2005          12/07/1993                &lt;NA&gt;         1
sum(data$diagnosis)
[1] 4

答案2

得分: 3

您可以使用 dplyr::if_any() 并借助 tidyr::replace_na() 进行操作。如 @jared_mamrot 所建议,您可以使用 + 将生成的逻辑向量强制转换为整数。

library(dplyr)
library(tidyr)

data %>%
  mutate(diagnosis = replace_na(
    +if_any(starts_with("disease_code"), \(x) x %in% c("I802", "G200", "H356")),
    0
  ))

或者,省略 +,将其保留为逻辑向量。您仍然可以执行算术操作,以及直接在逻辑测试中使用它(例如 ifelse(diagnosis, ...) 而不是 ifelse(diagnosis == 1, ...))。

英文:

You can use dplyr::if_any() with an assist from tidyr::replace_na(). As suggested by @jared_mamrot, you can use + to coerce the resulting logical vector to integer.

library(dplyr)
library(tidyr)

data %&gt;%
  mutate(diagnosis = replace_na(
    +if_any(starts_with(&quot;disease_code&quot;), \(x) x %in% c(&quot;I802&quot;, &quot;G200&quot;, &quot;H356&quot;)),
    0
  ))

Alternatively, omit the + and just keep as a logical vector. You’ll still be able to do arithmetic operations, plus use it directly in logical tests (i.e., ifelse(diagnosis, ...) rather than ifelse(diagnosis == 1, ...)).

答案3

得分: 1

我会这样处理:将原始数据框转换为长格式,然后从中选择行:

library(tidyr)
library(dplyr)

df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
                Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
                Disease_code_3 = c('H250', NA,NA,NA,NA),
                Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
                Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
                Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))

# 转换为长格式,将所有疾病代码放在一列中
dflong <- pivot_longer(df, -ID, names_sep = "_(?=\\d)", names_to=c(".value", "count"), values_to= "value")

# 找到匹配感兴趣的代码的行
hascondition <- which(dflong$Disease_code %in% c("I802", "G200", "H356"))

# 获取患者ID列表
patients <- unique(dflong$ID[hascondition])
length(patients)
#[1] 4

# 过滤/子集化原始数据框
subset <- df %>% filter(ID %in% patients)
英文:

The way I would approach this is to convert your original data frame into a long format and then select the rows from there:

library(tidyr)
library(dplyr)

df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c(&#39;I802&#39;, &#39;G200&#39;,&#39;I802&#39;,NA, &#39;H356&#39;),
                Disease_code_2 = c(&#39;A071&#39;,NA,&#39;G20&#39;,NA,&#39;I802&#39;),
                Disease_code_3 = c(&#39;H250&#39;, NA,NA,NA,NA),
                Date_of_diagnosis_1 = c(&#39;12/06/1997&#39;,&#39;13/06/1997&#39;,&#39;14/02/2003&#39;,NA,&#39;18/03/2005&#39;),
                Date_of_diagnosis_2 = c(&#39;12/06/1998&#39;,NA,&#39;18/09/2001&#39;,NA,&#39;12/07/1993&#39;),
                Date_of_diagnosis_3 = c(&#39;17/09/2010&#39;,NA,NA,NA,NA))

#convert to long format, all Disease codes in one column
dflong&lt;- pivot_longer(df, -ID, names_sep = &quot;_(?=\\d)&quot;, names_to=c(&quot;.value&quot;, &quot;count&quot;), values_to= &quot;value&quot;)

#find the rows that matches the code(s) of interest
hascondition &lt;- which(dflong$Disease_code %in% c(&quot;I802&quot;, &quot;G200&quot;, &quot;H356&quot;))

#get a list of patient IDs
patients &lt;- unique(dflong$ID[hascondition])
length(patients)
#[1] 4

#filter/subset the original data frame
subset &lt;- df %&gt;% filter(ID %in% patients)

答案4

得分: 0

data.table 中:

library(data.table)

setDT(df)[ , diagnosis := +any(.SD %in% c("I802", "G200", "H356")), 
              .(ID), ## or seq_len(NROW(df)
                .SDcols = names(df)[grep("Disease_code", names(df))]][]
#&gt;      ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1
#&gt; 1: 1001           I802           A071           H250          12/06/1997
#&gt; 2: 1002           G200           &lt;NA&gt;           &lt;NA&gt;          13/06/1997
#&gt; 3: 1003           I802            G20           &lt;NA&gt;          14/02/2003
#&gt; 4: 1004           &lt;NA&gt;           &lt;NA&gt;           &lt;NA&gt;                &lt;NA&gt;
#&gt; 5: 1005           H356           I802           &lt;NA&gt;          18/03/2005
#&gt;    Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
#&gt; 1:          12/06/1998          17/09/2010         1
#&gt; 2:                &lt;NA&gt;                &lt;NA&gt;         1
#&gt; 3:          18/09/2001                &lt;NA&gt;         1
#&gt; 4:                &lt;NA&gt;                &lt;NA&gt;         0
#&gt; 5:          12/07/1993                &lt;NA&gt;         1
英文:

In data.table:

library(data.table)

setDT(df)[ , diagnosis := +any(.SD %in% c(&quot;I802&quot;, &quot;G200&quot;, &quot;H356&quot;)), 
              .(ID), ## or seq_len(NROW(df)
                .SDcols = names(df)[grep(&quot;Disease_code&quot;, names(df))]][]
#&gt;      ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1
#&gt; 1: 1001           I802           A071           H250          12/06/1997
#&gt; 2: 1002           G200           &lt;NA&gt;           &lt;NA&gt;          13/06/1997
#&gt; 3: 1003           I802            G20           &lt;NA&gt;          14/02/2003
#&gt; 4: 1004           &lt;NA&gt;           &lt;NA&gt;           &lt;NA&gt;                &lt;NA&gt;
#&gt; 5: 1005           H356           I802           &lt;NA&gt;          18/03/2005
#&gt;    Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
#&gt; 1:          12/06/1998          17/09/2010         1
#&gt; 2:                &lt;NA&gt;                &lt;NA&gt;         1
#&gt; 3:          18/09/2001                &lt;NA&gt;         1
#&gt; 4:                &lt;NA&gt;                &lt;NA&gt;         0
#&gt; 5:          12/07/1993                &lt;NA&gt;         1

huangapple
  • 本文由 发表于 2023年5月25日 10:37:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76328545.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定