2023年5月25日 10:37:20go评论88阅读模式

英文:

Add a column indicating whether at least one matching value exists across multiple columns

问题

我正在尝试从诊断变量中选择一些ICD-10医疗编码，以检索数据集中特定诊断的集合，但在简洁的方式上遇到了很大的挑战。

为了背景信息，每个参与者ID（PID）有50多个不同的诊断实例，每个都编码为Disease_code_1，Disease_code_2，Disease_code_3等，相应的日期如下所示。

df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
                Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
                Disease_code_3 = c('H250', NA,NA,NA,NA),
                Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
                Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
                Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))

正如您所看到的，我在disease_code_1中有几个I802代码，还在disease_code2中，我想要将它们包含在我的分析中。

然而，当我使用简单的ifelse函数将这些我想要选择的变量组合在一起时，我不能使用“或”运算符，因为我得到0个参与者。

data$diagnosis <- with(data, ifelse((data$disease_code_1 == "I802||G200||H356"),1,0))
data$diagnosis[is.na(data$diagnosis)] <- 0
sum(data$diagnosis)

我得到了一个总和为0的结果。到目前为止，我唯一能够获得一个数字的方法是使用长格式。虽然这个代码中只有3个实例和3个疾病代码，但每个实例中有50多个实例和大约12个诊断需要计数，这意味着这行代码变得异常长，我希望能更简洁。

data$diagnosis <- with(data, ifelse(((data$disease_code_1 == "I802")|(data$disease_code_1 == "G200")|(data$disease_code_1 == "H356")|(data$disease_code_2 == "I802")|(data$disease_code_2 == "G200")|(data$disease_code_2 == "H356")|(data$disease_code_3 == "I802")|(data$disease_code_3 == "G200")|(data$disease_code_3 == "H356")), 1, 0))
data$diagnosis[is.na(data$diagnosis)] <- 0
sum(data$diagnosis)

这种方法将提供答案，但一旦长度达到一定程度，该行也将停止运行，导致命令行中出现+符号。

是否有一种方法可以使这个过程更加简洁？

英文:

I am attempting to select a number of ICD-10 medical codes from a diagnosis variable to retrieve a set of specific diagnoses within a dataset, however am having a very challenging time doing so in a succinct manner.

For context, there are over 50 different diagnostic instances for every participant ID (PID), each coded as Disease_code_1 , Disease_code_2 , Disease_code_3, etc. with corresponding dates as shown below.

df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c(&#39;I802&#39;, &#39;G200&#39;,&#39;I802&#39;,NA, &#39;H356&#39;),
                Disease_code_2 = c(&#39;A071&#39;,NA,&#39;G20&#39;,NA,&#39;I802&#39;),
                Disease_code_3 = c(&#39;H250&#39;, NA,NA,NA,NA),
                Date_of_diagnosis_1 = c(&#39;12/06/1997&#39;,&#39;13/06/1997&#39;,&#39;14/02/2003&#39;,NA,&#39;18/03/2005&#39;),
                Date_of_diagnosis_2 = c(&#39;12/06/1998&#39;,NA,&#39;18/09/2001&#39;,NA,&#39;12/07/1993&#39;),
                Date_of_diagnosis_3 = c(&#39;17/09/2010&#39;,NA,NA,NA,NA))
    ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
1 1001           I802           A071           H250          12/06/1997          12/06/1998          17/09/2010
2 1002           G200           &lt;NA&gt;           &lt;NA&gt;          13/06/1997                &lt;NA&gt;                &lt;NA&gt;
3 1003           I802            G20           &lt;NA&gt;          14/02/2003          18/09/2001                &lt;NA&gt;
4 1004           &lt;NA&gt;           &lt;NA&gt;           &lt;NA&gt;                &lt;NA&gt;                &lt;NA&gt;                &lt;NA&gt;
5 1005           H356           I802           &lt;NA&gt;          18/03/2005          12/07/1993                &lt;NA&gt;

As can be seen I have several I802 codes within disease_code_1, but also within disease_code2 that I want to include in my analysis.

However, when I combine these variables that I am wanting to select together using a simple ifelse function, I cannot use an 'or' operator as I get 0 participants

data$diagnosis&lt;- with(data, ifelse((data$disease_code_1 == &quot;I802||G200||H356&quot;),1,0))
data$diagnosis[is.na(data$diagnosis)] &lt;- 0
sum(data$diagnosis)

I am left with a sum of 0. The only way I can achieve a number so far is using a long format. Although there are only 3 instances and 3 disease codes in this code, there are over 50 instances and approximately 12 diagnoses that need to be counted for each, meaning this line of code becomes exceptionally long and I would like to make it more succinct.

data$diagnosis &lt;- with(data, ifelse(((data$disease_code_1 == &quot;I802&quot;)|(data$disease_code_1 == &quot;G200&quot;)|(data$disease_code_1 == &quot;H356&quot;)|(data$disease_code_2 == &quot;I802&quot;)|(data$disease_code_2 == &quot;G200&quot;)|(data$disease_code_2 == &quot;H356&quot;)|(data$disease_code_3 == &quot;I802&quot;)|(data$disease_code_3 == &quot;G200&quot;)|(data$disease_code_3 == &quot;H356&quot;)), 1, 0))
data$diagnosis[is.na(data$diagnosis)] &lt;- 0
sum(data$diagnosis)

This method will provide an answer, however once I get to a certain length, the line will also cease to run, resulting in a + symbol within the command line.

Is there a way I can make this much more succinct?

答案1

得分: 3

关于你的代码，有一些注意事项：

当你使用 with 时，不再需要使用 data$ 语法，只需 with(data, ifelse(disease_code_1 == "I802||G200||H356"),1,0) 就足够了。
我们不能使用 == 来比较一组字符串。
在你的代码中，你使用小写的 disease_code，但在你的数据中，你使用了大写的 Disease_code。

解决方案

要检查一组字符串的存在，我们可以使用 %in% 运算符，并且可以使用 apply 来迭代数据框的行。

data$diagnosis <- apply(data[startsWith(colnames(data), "Disease_code_")], 1, function(x) as.integer(any(x %in% c("I802", "G200", "H356"))))

ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
1 1001           I802           A071           H250          12/06/1997          12/06/1998          17/09/2010         1
2 1002           G200           <NA>           <NA>          13/06/1997                <NA>                <NA>         1
3 1003           I802            G20           <NA>          14/02/2003          18/09/2001                <NA>         1
4 1004           <NA>           <NA>           <NA>                <NA>                <NA>                <NA>         0
5 1005           H356           I802           <NA>          18/03/2005          12/07/1993                <NA>         1

sum(data$diagnosis)
[1] 4

英文:

A few notes regarding your codes:

When you use with, you no longer need the data$ syntax, just with(data, ifelse(disease_code_1 == "I802||G200||H356"),1,0) is enough
We cannot use == to compare a set of strings
In your code you have lower-case disease_code but in your data you have upper-case Disease_code

<hr>

Solution

To check the existence of a set of strings, we can use the %in% operator, and we can use apply to iterate over the rows of the data frame.

data$diagnosis &lt;- apply(data[startsWith(colnames(data), &quot;Disease_code_&quot;)], 1, \(x) as.integer(any(x %in% c(&quot;I802&quot;, &quot;G200&quot;, &quot;H356&quot;))))
    ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
1 1001           I802           A071           H250          12/06/1997          12/06/1998          17/09/2010         1
2 1002           G200           &lt;NA&gt;           &lt;NA&gt;          13/06/1997                &lt;NA&gt;                &lt;NA&gt;         1
3 1003           I802            G20           &lt;NA&gt;          14/02/2003          18/09/2001                &lt;NA&gt;         1
4 1004           &lt;NA&gt;           &lt;NA&gt;           &lt;NA&gt;                &lt;NA&gt;                &lt;NA&gt;                &lt;NA&gt;         0
5 1005           H356           I802           &lt;NA&gt;          18/03/2005          12/07/1993                &lt;NA&gt;         1

sum(data$diagnosis)
[1] 4

答案2

得分: 3

您可以使用 dplyr::if_any() 并借助 tidyr::replace_na() 进行操作。如 @jared_mamrot 所建议，您可以使用 + 将生成的逻辑向量强制转换为整数。

library(dplyr)
library(tidyr)
data %>%
  mutate(diagnosis = replace_na(
    +if_any(starts_with("disease_code"), \(x) x %in% c("I802", "G200", "H356")),
    0
  ))

或者，省略 +，将其保留为逻辑向量。您仍然可以执行算术操作，以及直接在逻辑测试中使用它（例如 ifelse(diagnosis, ...) 而不是 ifelse(diagnosis == 1, ...)）。

英文:

You can use dplyr::if_any() with an assist from tidyr::replace_na(). As suggested by @jared_mamrot, you can use + to coerce the resulting logical vector to integer.

library(dplyr)
library(tidyr)
data %&gt;%
  mutate(diagnosis = replace_na(
    +if_any(starts_with(&quot;disease_code&quot;), \(x) x %in% c(&quot;I802&quot;, &quot;G200&quot;, &quot;H356&quot;)),
    0
  ))

Alternatively, omit the + and just keep as a logical vector. You’ll still be able to do arithmetic operations, plus use it directly in logical tests (i.e., ifelse(diagnosis, ...) rather than ifelse(diagnosis == 1, ...)).

答案3

得分: 1

我会这样处理：将原始数据框转换为长格式，然后从中选择行：

library(tidyr)
library(dplyr)
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
                Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
                Disease_code_3 = c('H250', NA,NA,NA,NA),
                Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
                Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
                Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))
# 转换为长格式，将所有疾病代码放在一列中
dflong <- pivot_longer(df, -ID, names_sep = "_(?=\\d)", names_to=c(".value", "count"), values_to= "value")
# 找到匹配感兴趣的代码的行
hascondition <- which(dflong$Disease_code %in% c("I802", "G200", "H356"))
# 获取患者ID列表
patients <- unique(dflong$ID[hascondition])
length(patients)
#[1] 4
# 过滤/子集化原始数据框
subset <- df %>% filter(ID %in% patients)

英文:

The way I would approach this is to convert your original data frame into a long format and then select the rows from there:

library(tidyr)
library(dplyr)
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c(&#39;I802&#39;, &#39;G200&#39;,&#39;I802&#39;,NA, &#39;H356&#39;),
                Disease_code_2 = c(&#39;A071&#39;,NA,&#39;G20&#39;,NA,&#39;I802&#39;),
                Disease_code_3 = c(&#39;H250&#39;, NA,NA,NA,NA),
                Date_of_diagnosis_1 = c(&#39;12/06/1997&#39;,&#39;13/06/1997&#39;,&#39;14/02/2003&#39;,NA,&#39;18/03/2005&#39;),
                Date_of_diagnosis_2 = c(&#39;12/06/1998&#39;,NA,&#39;18/09/2001&#39;,NA,&#39;12/07/1993&#39;),
                Date_of_diagnosis_3 = c(&#39;17/09/2010&#39;,NA,NA,NA,NA))
#convert to long format, all Disease codes in one column
dflong&lt;- pivot_longer(df, -ID, names_sep = &quot;_(?=\\d)&quot;, names_to=c(&quot;.value&quot;, &quot;count&quot;), values_to= &quot;value&quot;)
#find the rows that matches the code(s) of interest
hascondition &lt;- which(dflong$Disease_code %in% c(&quot;I802&quot;, &quot;G200&quot;, &quot;H356&quot;))
#get a list of patient IDs
patients &lt;- unique(dflong$ID[hascondition])
length(patients)
#[1] 4
#filter/subset the original data frame
subset &lt;- df %&gt;% filter(ID %in% patients)

答案4

得分: 0

在 data.table 中：

library(data.table)
setDT(df)[ , diagnosis := +any(.SD %in% c("I802", "G200", "H356")), 
              .(ID), ## or seq_len(NROW(df)
                .SDcols = names(df)[grep("Disease_code", names(df))]][]

#&gt;      ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1
#&gt; 1: 1001           I802           A071           H250          12/06/1997
#&gt; 2: 1002           G200           &lt;NA&gt;           &lt;NA&gt;          13/06/1997
#&gt; 3: 1003           I802            G20           &lt;NA&gt;          14/02/2003
#&gt; 4: 1004           &lt;NA&gt;           &lt;NA&gt;           &lt;NA&gt;                &lt;NA&gt;
#&gt; 5: 1005           H356           I802           &lt;NA&gt;          18/03/2005
#&gt;    Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
#&gt; 1:          12/06/1998          17/09/2010         1
#&gt; 2:                &lt;NA&gt;                &lt;NA&gt;         1
#&gt; 3:          18/09/2001                &lt;NA&gt;         1
#&gt; 4:                &lt;NA&gt;                &lt;NA&gt;         0
#&gt; 5:          12/07/1993                &lt;NA&gt;         1

英文:

In data.table:

library(data.table)
setDT(df)[ , diagnosis := +any(.SD %in% c(&quot;I802&quot;, &quot;G200&quot;, &quot;H356&quot;)), 
              .(ID), ## or seq_len(NROW(df)
                .SDcols = names(df)[grep(&quot;Disease_code&quot;, names(df))]][]

#&gt;      ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1
#&gt; 1: 1001           I802           A071           H250          12/06/1997
#&gt; 2: 1002           G200           &lt;NA&gt;           &lt;NA&gt;          13/06/1997
#&gt; 3: 1003           I802            G20           &lt;NA&gt;          14/02/2003
#&gt; 4: 1004           &lt;NA&gt;           &lt;NA&gt;           &lt;NA&gt;                &lt;NA&gt;
#&gt; 5: 1005           H356           I802           &lt;NA&gt;          18/03/2005
#&gt;    Date_of_diagnosis_2 Date_of_diagnosis_3 diagnosis
#&gt; 1:          12/06/1998          17/09/2010         1
#&gt; 2:                &lt;NA&gt;                &lt;NA&gt;         1
#&gt; 3:          18/09/2001                &lt;NA&gt;         1
#&gt; 4:                &lt;NA&gt;                &lt;NA&gt;         0
#&gt; 5:          12/07/1993                &lt;NA&gt;         1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

添加一列，指示是否存在多列中至少一个匹配值。

问题

答案1

解决方案

Solution

答案2

答案3

答案4

使用累加函数accumulate和/或cumsum来正确进行投影。

有没有办法根据R中的多个列条件，按ID分组来折叠行？

Plotting multiple dendrogram plots by ggplot and faceting them by 2 two methods by ggplot, R

在Pandas中，按另一列对数据进行分组，计算行之间的百分比变化。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。