有没有一种更简洁的方法来从我的R数据集中获取最早的诊断和代码?

huangapple go评论141阅读模式
英文:

Is there a more concise way to subset and extract the earliest diagnoses and codes in my R dataset?

问题

我有一个数据集,我正在尝试确定疾病的最早诊断时间,以及与该诊断相关联的代码。这是我正在处理的文件的简短版本。

不幸的是,第一个疾病代码并不总是最早的诊断,如ID 1005所示。

  1. df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
  2. Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
  3. Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
  4. Disease_code_3 = c('H250', NA,NA,NA,NA),
  5. Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
  6. Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
  7. Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))
  8. ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
  9. 1 1001 I802 A071 H250 12/06/1997 12/06/1998 17/09/2010
  10. 2 1002 G200 <NA> <NA> 13/06/1997 <NA> <NA>
  11. 3 1003 I802 G20 <NA> 14/02/2003 18/09/2001 <NA>
  12. 4 1004 <NA> <NA> <NA> <NA> <NA> <NA>
  13. 5 1005 H356 I802 <NA> 18/03/2005 12/07/1993 <NA>

我尝试创建多个每个代码和日期的变量子集,然后进行行绑定,然后只保留最早的诊断,但考虑到我需要包括的所有其他协变量和变量,这相当冗长。

  1. Disease_1 <- as.data.frame((cbind(df$ID, df$Disease_code_1, df$Date_of_diagnosis_1)))
  2. Disease_2 <- as.data.frame((cbind(df$ID, df$Disease_code_2, df$Date_of_diagnosis_2)))
  3. Disease_3 <- as.data.frame((cbind(df$ID, df$Disease_code_3, df$Date_of_diagnosis_3)))
  4. Disease_data <- rbind(Disease_1, Disease_2, Disease_3)
  5. colnames(Disease_data) = c("id","Disease_code","Date_of_diagnosis")
  6. # 编辑Disease_data,仅包括最早诊断的参与者
  7. Disease_data <- Disease_data [order(Disease_data[,'id'],Disease_data[,'Date_of_diagnosis']),]
  8. Disease_data <- Disease_data[!duplicated(Disease_data$id),]

这是一个简化版本,但我会有超过25个Disease_数据框的迭代,每个数据框每个约有100个变量,这可以工作,但非常臃肿,如果可能的话,我希望能够更简洁。我理解只包括最早参与者的数据已经非常简洁,但这是这种方法的设置。是否有一种可能使用startswith命令的方法?我尝试过但没有成功。

英文:

I have a dataset where I am attempting to determine the earliest diagnosis of disease, as well as the code associated with that diagnosis. This is a much shorter version of the file I am working with.

Unfortunately, the first disease code is not always the earliest diagnosis, as can be seen by ID 1005

  1. df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
  2. Disease_code_1 = c(&#39;I802&#39;, &#39;G200&#39;,&#39;I802&#39;,NA, &#39;H356&#39;),
  3. Disease_code_2 = c(&#39;A071&#39;,NA,&#39;G20&#39;,NA,&#39;I802&#39;),
  4. Disease_code_3 = c(&#39;H250&#39;, NA,NA,NA,NA),
  5. Date_of_diagnosis_1 = c(&#39;12/06/1997&#39;,&#39;13/06/1997&#39;,&#39;14/02/2003&#39;,NA,&#39;18/03/2005&#39;),
  6. Date_of_diagnosis_2 = c(&#39;12/06/1998&#39;,NA,&#39;18/09/2001&#39;,NA,&#39;12/07/1993&#39;),
  7. Date_of_diagnosis_3 = c(&#39;17/09/2010&#39;,NA,NA,NA,NA))
  8. ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
  9. 1 1001 I802 A071 H250 12/06/1997 12/06/1998 17/09/2010
  10. 2 1002 G200 &lt;NA&gt; &lt;NA&gt; 13/06/1997 &lt;NA&gt; &lt;NA&gt;
  11. 3 1003 I802 G20 &lt;NA&gt; 14/02/2003 18/09/2001 &lt;NA&gt;
  12. 4 1004 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
  13. 5 1005 H356 I802 &lt;NA&gt; 18/03/2005 12/07/1993 &lt;NA&gt;

I have attempted to create multiple subsets of my variables for each code and date as shown below, row bind and then only keep the earliest diagnoses, however it is quite lengthy when considering all of my other covariates and variables that I need to include.

  1. Disease_1 &lt;- as.data.frame((cbind(df$ID, df$Disease_code_1, df$Date_of_diagnosis_1)))
  2. Disease_2 &lt;- as.data.frame((cbind(df$ID, df$Disease_code_2, df$Date_of_diagnosis_2)))
  3. Disease_3 &lt;- as.data.frame((cbind(df$ID, df$Disease_code_3, df$Date_of_diagnosis_3)))
  4. Disease_data &lt;- rbind(Disease_1, Disease_2, Disease_3)
  5. colnames(Disease_data) = c(&quot;id&quot;,&quot;Disease_code&quot;,&quot;Date_of_diagnosis&quot;)
  6. #Edit Diseasedate to only include a participant once based on earliest diagnosis
  7. Disease_data &lt;- Disease_data [order(Disease_data [,&#39;id&#39;],Disease_data [,&#39;Date_of_diagnosis&#39;]),]
  8. Disease_data &lt;- Disease_data [!duplicated(Disease_data $id),]

This is a simplified version, but I would have over 25 iterations of the Disease_ data frames, each with approximately 100 variables per data frame which works, but is very chunky and if possible would like to make it more succinct.

I understand that editing the data to include only the earliest participant is already very succinct, but it is the set-up for this method. Is there a way to use the startswith command that may work? I have attempted but with no success.

答案1

得分: 3

你的数据在这里很难处理,因为它是宽格式的 - 将其转换为长格式 将更容易解决问题:

使用 tidyverse 中的 dplyr 包,我们将除了 ID 列以外的所有列都转换成名称-值对,每个都有一个独立的行。由于示例变量名具有模式 [VALUE的类型]_[顺序号],我们可以在 names_pattern 参数中使用正则表达式来将值名称分割成一个 name 列,该列包含测量类型(这里包括 "Disease_code" 或 "Date_of_diagnosis")和顺序号。

然后,我们将其转换为宽格式,以创建单个 Disease_codeDate_of_diagnosis 列,每个ID和数字组合都有一行。最后,我们将 Date_of_diagnosis 转换为日期格式(使用 lubridate::dmy,因为它以天/月/年格式表示),以便正确排序,然后过滤掉 Disease_code 为 NA 的行:

  1. library(tidyverse)
  2. df2 <- df %>%
  3. pivot_longer(-ID,
  4. names_pattern = "(.*)_(\\d*$)",
  5. names_to = c('name', 'num')) %>%
  6. pivot_wider() %>%
  7. mutate(Date_of_diagnosis = dmy(Date_of_diagnosis)) %>%
  8. filter(!is.na(Disease_code))
  9. df2
  10. # A tibble: 8 × 4
  11. ID num Disease_code Date_of_diagnosis
  12. <dbl> <chr> <chr> <date>
  13. 1 1001 1 I802 1997-06-12
  14. 2 1001 2 A071 1998-06-12
  15. 3 1001 3 H250 2010-09-17
  16. 4 1002 1 G200 1997-06-13
  17. 5 1003 1 I802 2003-02-14
  18. 6 1003 2 G20 2001-09-18
  19. 7 1005 1 H356 2005-03-18
  20. 8 1005 2 I802 1993-07-12

现在数据格式正确,我们可以按ID分组,选择每个ID的最早日期(最小值):

  1. df2 %>%
  2. group_by(ID) %>%
  3. slice_min(Date_of_diagnosis)
  4. ID num Disease_code Date_of_diagnosis
  5. <dbl> <chr> <chr> <date>
  6. 1 1001 1 I802 1997-06-12
  7. 2 1002 1 G200 1997-06-13
  8. 3 1003 2 G20 2001-09-18
  9. 4 1005 2 I802 1993-07-12
英文:

Your data is so hard to work with here because it's in wide form - convert it to long-form and it will be much easier to solve. :

Using the dplyr package in tidyverse, we take all the columns other than ID and pivot them into name-value pairs, each with an individual row. Since the example variable names have the pattern [TYPE OF VALUE]_[SEQUENTIAL NUMBER] we can use a regular expression in the names_pattern argument to split the value name into a name column which has the type of measure (here contining either "Disease_code" or "Date_of_diagnosis") and that sequential number.

Then, we pivot it wider, to make a single Disease_code and Date_of_diagnosis column, with 1 row for each ID and number combination. Finally we convert Date_of_diagnosis to a Date format (using lubridate::dmy since it's in Day/Month/Year format) so that it's ordered properly, and then filter out the rows where Disease_code is NA, meaning there was no value there:

  1. library(tidyverse)
  2. df2 &lt;- df %&gt;%
  3. pivot_longer(-ID,
  4. names_pattern = &quot;(.*)_([0-9]*$)&quot;,
  5. names_to = c(&#39;name&#39;, &#39;num&#39;)) %&gt;%
  6. pivot_wider() %&gt;%
  7. mutate(Date_of_diagnosis = dmy(Date_of_diagnosis)) %&gt;%
  8. filter(!is.na(Disease_code))
  9. df2
  10. # A tibble: 8 &#215; 4
  11. ID num Disease_code Date_of_diagnosis
  12. &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt;
  13. 1 1001 1 I802 1997-06-12
  14. 2 1001 2 A071 1998-06-12
  15. 3 1001 3 H250 2010-09-17
  16. 4 1002 1 G200 1997-06-13
  17. 5 1003 1 I802 2003-02-14
  18. 6 1003 2 G20 2001-09-18
  19. 7 1005 1 H356 2005-03-18
  20. 8 1005 2 I802 1993-07-12

Now that the data is properly formatted, we can just group by ID and pick the earliest date (minimum value) for each:

  1. df2 %&gt;%
  2. group_by(ID) %&gt;%
  3. slice_min(Date_of_diagnosis)
  4. ID num Disease_code Date_of_diagnosis
  5. &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt;
  6. 1 1001 1 I802 1997-06-12
  7. 2 1002 1 G200 1997-06-13
  8. 3 1003 2 G20 2001-09-18
  9. 4 1005 2 I802 1993-07-12

答案2

得分: 2

A similar approach to @divibisan, but more concise. You can make use of the .value helper so that a single pivot_longer is enough.

  1. library(tidyverse)
  2. df |&gt;
  3. pivot_longer(-ID, names_pattern = &quot;(.*)_(\\d+)&quot;, names_to = c(&quot;.value&quot;, &quot;group&quot;)) |&gt;
  4. slice_min(dmy(Date_of_diagnosis), by = &quot;ID&quot;, na_rm = T)
  5. # A tibble: 4 &#215; 4
  6. ID group Disease_code Date_of_diagnosis
  7. &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  8. 1 1001 1 I802 12/06/1997
  9. 2 1002 1 G200 13/06/1997
  10. 3 1003 2 G20 18/09/2001
  11. 4 1005 2 I802 12/07/1993
英文:

A similar approach to @divibisan, but more concise. You can make use of the .value helper so that a single pivot_longer is enough.

  1. library(tidyverse)
  2. df |&gt;
  3. pivot_longer(-ID, names_pattern = &quot;(.*)_(\\d+)&quot;, names_to = c(&quot;.value&quot;, &quot;group&quot;)) |&gt;
  4. slice_min(dmy(Date_of_diagnosis), by = &quot;ID&quot;, na_rm = T)
  5. # A tibble: 4 &#215; 4
  6. ID group Disease_code Date_of_diagnosis
  7. &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  8. 1 1001 1 I802 12/06/1997
  9. 2 1002 1 G200 13/06/1997
  10. 3 1003 2 G20 18/09/2001
  11. 4 1005 2 I802 12/07/1993

huangapple
  • 本文由 发表于 2023年6月1日 10:09:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76378266.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定