有没有一种更简洁的方法来从我的R数据集中获取最早的诊断和代码?

huangapple go评论116阅读模式
英文:

Is there a more concise way to subset and extract the earliest diagnoses and codes in my R dataset?

问题

我有一个数据集,我正在尝试确定疾病的最早诊断时间,以及与该诊断相关联的代码。这是我正在处理的文件的简短版本。

不幸的是,第一个疾病代码并不总是最早的诊断,如ID 1005所示。

df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
                Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
                Disease_code_3 = c('H250', NA,NA,NA,NA),
                Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
                Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
                Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))

    ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
1 1001           I802           A071           H250          12/06/1997          12/06/1998          17/09/2010
2 1002           G200           <NA>           <NA>          13/06/1997                <NA>                <NA>
3 1003           I802            G20           <NA>          14/02/2003          18/09/2001                <NA>
4 1004           <NA>           <NA>           <NA>                <NA>                <NA>                <NA>
5 1005           H356           I802           <NA>          18/03/2005          12/07/1993                <NA>

我尝试创建多个每个代码和日期的变量子集,然后进行行绑定,然后只保留最早的诊断,但考虑到我需要包括的所有其他协变量和变量,这相当冗长。

Disease_1 <- as.data.frame((cbind(df$ID, df$Disease_code_1, df$Date_of_diagnosis_1)))
Disease_2 <- as.data.frame((cbind(df$ID, df$Disease_code_2, df$Date_of_diagnosis_2)))
Disease_3 <- as.data.frame((cbind(df$ID, df$Disease_code_3, df$Date_of_diagnosis_3)))

Disease_data <- rbind(Disease_1, Disease_2, Disease_3)
colnames(Disease_data) = c("id","Disease_code","Date_of_diagnosis")

# 编辑Disease_data,仅包括最早诊断的参与者
Disease_data <- Disease_data [order(Disease_data[,'id'],Disease_data[,'Date_of_diagnosis']),]
Disease_data <- Disease_data[!duplicated(Disease_data$id),]

这是一个简化版本,但我会有超过25个Disease_数据框的迭代,每个数据框每个约有100个变量,这可以工作,但非常臃肿,如果可能的话,我希望能够更简洁。我理解只包括最早参与者的数据已经非常简洁,但这是这种方法的设置。是否有一种可能使用startswith命令的方法?我尝试过但没有成功。

英文:

I have a dataset where I am attempting to determine the earliest diagnosis of disease, as well as the code associated with that diagnosis. This is a much shorter version of the file I am working with.

Unfortunately, the first disease code is not always the earliest diagnosis, as can be seen by ID 1005

df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c(&#39;I802&#39;, &#39;G200&#39;,&#39;I802&#39;,NA, &#39;H356&#39;),
                Disease_code_2 = c(&#39;A071&#39;,NA,&#39;G20&#39;,NA,&#39;I802&#39;),
                Disease_code_3 = c(&#39;H250&#39;, NA,NA,NA,NA),
                Date_of_diagnosis_1 = c(&#39;12/06/1997&#39;,&#39;13/06/1997&#39;,&#39;14/02/2003&#39;,NA,&#39;18/03/2005&#39;),
                Date_of_diagnosis_2 = c(&#39;12/06/1998&#39;,NA,&#39;18/09/2001&#39;,NA,&#39;12/07/1993&#39;),
                Date_of_diagnosis_3 = c(&#39;17/09/2010&#39;,NA,NA,NA,NA))

    ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
1 1001           I802           A071           H250          12/06/1997          12/06/1998          17/09/2010
2 1002           G200           &lt;NA&gt;           &lt;NA&gt;          13/06/1997                &lt;NA&gt;                &lt;NA&gt;
3 1003           I802            G20           &lt;NA&gt;          14/02/2003          18/09/2001                &lt;NA&gt;
4 1004           &lt;NA&gt;           &lt;NA&gt;           &lt;NA&gt;                &lt;NA&gt;                &lt;NA&gt;                &lt;NA&gt;
5 1005           H356           I802           &lt;NA&gt;          18/03/2005          12/07/1993                &lt;NA&gt;

I have attempted to create multiple subsets of my variables for each code and date as shown below, row bind and then only keep the earliest diagnoses, however it is quite lengthy when considering all of my other covariates and variables that I need to include.

Disease_1 &lt;- as.data.frame((cbind(df$ID, df$Disease_code_1, df$Date_of_diagnosis_1)))
Disease_2 &lt;- as.data.frame((cbind(df$ID, df$Disease_code_2, df$Date_of_diagnosis_2)))
Disease_3 &lt;- as.data.frame((cbind(df$ID, df$Disease_code_3, df$Date_of_diagnosis_3)))

Disease_data &lt;- rbind(Disease_1, Disease_2, Disease_3)
colnames(Disease_data) = c(&quot;id&quot;,&quot;Disease_code&quot;,&quot;Date_of_diagnosis&quot;)

#Edit Diseasedate to only include a participant once based on earliest diagnosis 
Disease_data &lt;- Disease_data [order(Disease_data [,&#39;id&#39;],Disease_data [,&#39;Date_of_diagnosis&#39;]),]
Disease_data &lt;- Disease_data [!duplicated(Disease_data $id),]

This is a simplified version, but I would have over 25 iterations of the Disease_ data frames, each with approximately 100 variables per data frame which works, but is very chunky and if possible would like to make it more succinct.

I understand that editing the data to include only the earliest participant is already very succinct, but it is the set-up for this method. Is there a way to use the startswith command that may work? I have attempted but with no success.

答案1

得分: 3

你的数据在这里很难处理,因为它是宽格式的 - 将其转换为长格式 将更容易解决问题:

使用 tidyverse 中的 dplyr 包,我们将除了 ID 列以外的所有列都转换成名称-值对,每个都有一个独立的行。由于示例变量名具有模式 [VALUE的类型]_[顺序号],我们可以在 names_pattern 参数中使用正则表达式来将值名称分割成一个 name 列,该列包含测量类型(这里包括 "Disease_code" 或 "Date_of_diagnosis")和顺序号。

然后,我们将其转换为宽格式,以创建单个 Disease_codeDate_of_diagnosis 列,每个ID和数字组合都有一行。最后,我们将 Date_of_diagnosis 转换为日期格式(使用 lubridate::dmy,因为它以天/月/年格式表示),以便正确排序,然后过滤掉 Disease_code 为 NA 的行:

library(tidyverse)

df2 <- df %>%
    pivot_longer(-ID,
                 names_pattern = "(.*)_(\\d*$)",
                 names_to = c('name', 'num')) %>%
    pivot_wider() %>%
    mutate(Date_of_diagnosis = dmy(Date_of_diagnosis)) %>%
    filter(!is.na(Disease_code))

df2
# A tibble: 8 × 4
     ID num   Disease_code Date_of_diagnosis
  <dbl> <chr> <chr>        <date>           
1  1001 1     I802         1997-06-12       
2  1001 2     A071         1998-06-12       
3  1001 3     H250         2010-09-17       
4  1002 1     G200         1997-06-13       
5  1003 1     I802         2003-02-14       
6  1003 2     G20          2001-09-18       
7  1005 1     H356         2005-03-18       
8  1005 2     I802         1993-07-12       

现在数据格式正确,我们可以按ID分组,选择每个ID的最早日期(最小值):

df2 %>%
    group_by(ID) %>%
    slice_min(Date_of_diagnosis)

     ID num   Disease_code Date_of_diagnosis
  <dbl> <chr> <chr>        <date>           
1  1001 1     I802         1997-06-12       
2  1002 1     G200         1997-06-13       
3  1003 2     G20          2001-09-18       
4  1005 2     I802         1993-07-12       
英文:

Your data is so hard to work with here because it's in wide form - convert it to long-form and it will be much easier to solve. :

Using the dplyr package in tidyverse, we take all the columns other than ID and pivot them into name-value pairs, each with an individual row. Since the example variable names have the pattern [TYPE OF VALUE]_[SEQUENTIAL NUMBER] we can use a regular expression in the names_pattern argument to split the value name into a name column which has the type of measure (here contining either "Disease_code" or "Date_of_diagnosis") and that sequential number.

Then, we pivot it wider, to make a single Disease_code and Date_of_diagnosis column, with 1 row for each ID and number combination. Finally we convert Date_of_diagnosis to a Date format (using lubridate::dmy since it's in Day/Month/Year format) so that it's ordered properly, and then filter out the rows where Disease_code is NA, meaning there was no value there:

library(tidyverse)

df2 &lt;- df %&gt;%
    pivot_longer(-ID,
                 names_pattern = &quot;(.*)_([0-9]*$)&quot;,
                 names_to = c(&#39;name&#39;, &#39;num&#39;)) %&gt;%
    pivot_wider() %&gt;%
    mutate(Date_of_diagnosis = dmy(Date_of_diagnosis)) %&gt;%
    filter(!is.na(Disease_code))

df2
# A tibble: 8 &#215; 4
     ID num   Disease_code Date_of_diagnosis
  &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;        &lt;date&gt;           
1  1001 1     I802         1997-06-12       
2  1001 2     A071         1998-06-12       
3  1001 3     H250         2010-09-17       
4  1002 1     G200         1997-06-13       
5  1003 1     I802         2003-02-14       
6  1003 2     G20          2001-09-18       
7  1005 1     H356         2005-03-18       
8  1005 2     I802         1993-07-12       

Now that the data is properly formatted, we can just group by ID and pick the earliest date (minimum value) for each:

df2 %&gt;%
    group_by(ID) %&gt;%
    slice_min(Date_of_diagnosis)

     ID num   Disease_code Date_of_diagnosis
  &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;        &lt;date&gt;           
1  1001 1     I802         1997-06-12       
2  1002 1     G200         1997-06-13       
3  1003 2     G20          2001-09-18       
4  1005 2     I802         1993-07-12       

答案2

得分: 2

A similar approach to @divibisan, but more concise. You can make use of the .value helper so that a single pivot_longer is enough.

library(tidyverse)

df |&gt; 
  pivot_longer(-ID, names_pattern = &quot;(.*)_(\\d+)&quot;, names_to = c(&quot;.value&quot;, &quot;group&quot;)) |&gt; 
  slice_min(dmy(Date_of_diagnosis), by = &quot;ID&quot;, na_rm = T)

# A tibble: 4 &#215; 4
     ID group Disease_code Date_of_diagnosis
  &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;        &lt;chr&gt;            
1  1001 1     I802         12/06/1997       
2  1002 1     G200         13/06/1997       
3  1003 2     G20          18/09/2001       
4  1005 2     I802         12/07/1993  
英文:

A similar approach to @divibisan, but more concise. You can make use of the .value helper so that a single pivot_longer is enough.

library(tidyverse)

df |&gt; 
  pivot_longer(-ID, names_pattern = &quot;(.*)_(\\d+)&quot;, names_to = c(&quot;.value&quot;, &quot;group&quot;)) |&gt; 
  slice_min(dmy(Date_of_diagnosis), by = &quot;ID&quot;, na_rm = T)

# A tibble: 4 &#215; 4
     ID group Disease_code Date_of_diagnosis
  &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;        &lt;chr&gt;            
1  1001 1     I802         12/06/1997       
2  1002 1     G200         13/06/1997       
3  1003 2     G20          18/09/2001       
4  1005 2     I802         12/07/1993  

huangapple
  • 本文由 发表于 2023年6月1日 10:09:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76378266.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定