英文:
Is there a more concise way to subset and extract the earliest diagnoses and codes in my R dataset?
问题
我有一个数据集,我正在尝试确定疾病的最早诊断时间,以及与该诊断相关联的代码。这是我正在处理的文件的简短版本。
不幸的是,第一个疾病代码并不总是最早的诊断,如ID 1005
所示。
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
Disease_code_3 = c('H250', NA,NA,NA,NA),
Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))
ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
1 1001 I802 A071 H250 12/06/1997 12/06/1998 17/09/2010
2 1002 G200 <NA> <NA> 13/06/1997 <NA> <NA>
3 1003 I802 G20 <NA> 14/02/2003 18/09/2001 <NA>
4 1004 <NA> <NA> <NA> <NA> <NA> <NA>
5 1005 H356 I802 <NA> 18/03/2005 12/07/1993 <NA>
我尝试创建多个每个代码和日期的变量子集,然后进行行绑定,然后只保留最早的诊断,但考虑到我需要包括的所有其他协变量和变量,这相当冗长。
Disease_1 <- as.data.frame((cbind(df$ID, df$Disease_code_1, df$Date_of_diagnosis_1)))
Disease_2 <- as.data.frame((cbind(df$ID, df$Disease_code_2, df$Date_of_diagnosis_2)))
Disease_3 <- as.data.frame((cbind(df$ID, df$Disease_code_3, df$Date_of_diagnosis_3)))
Disease_data <- rbind(Disease_1, Disease_2, Disease_3)
colnames(Disease_data) = c("id","Disease_code","Date_of_diagnosis")
# 编辑Disease_data,仅包括最早诊断的参与者
Disease_data <- Disease_data [order(Disease_data[,'id'],Disease_data[,'Date_of_diagnosis']),]
Disease_data <- Disease_data[!duplicated(Disease_data$id),]
这是一个简化版本,但我会有超过25个Disease_
数据框的迭代,每个数据框每个约有100个变量,这可以工作,但非常臃肿,如果可能的话,我希望能够更简洁。我理解只包括最早参与者的数据已经非常简洁,但这是这种方法的设置。是否有一种可能使用startswith
命令的方法?我尝试过但没有成功。
英文:
I have a dataset where I am attempting to determine the earliest diagnosis of disease, as well as the code associated with that diagnosis. This is a much shorter version of the file I am working with.
Unfortunately, the first disease code is not always the earliest diagnosis, as can be seen by ID 1005
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
Disease_code_3 = c('H250', NA,NA,NA,NA),
Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))
ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
1 1001 I802 A071 H250 12/06/1997 12/06/1998 17/09/2010
2 1002 G200 <NA> <NA> 13/06/1997 <NA> <NA>
3 1003 I802 G20 <NA> 14/02/2003 18/09/2001 <NA>
4 1004 <NA> <NA> <NA> <NA> <NA> <NA>
5 1005 H356 I802 <NA> 18/03/2005 12/07/1993 <NA>
I have attempted to create multiple subsets of my variables for each code and date as shown below, row bind and then only keep the earliest diagnoses, however it is quite lengthy when considering all of my other covariates and variables that I need to include.
Disease_1 <- as.data.frame((cbind(df$ID, df$Disease_code_1, df$Date_of_diagnosis_1)))
Disease_2 <- as.data.frame((cbind(df$ID, df$Disease_code_2, df$Date_of_diagnosis_2)))
Disease_3 <- as.data.frame((cbind(df$ID, df$Disease_code_3, df$Date_of_diagnosis_3)))
Disease_data <- rbind(Disease_1, Disease_2, Disease_3)
colnames(Disease_data) = c("id","Disease_code","Date_of_diagnosis")
#Edit Diseasedate to only include a participant once based on earliest diagnosis
Disease_data <- Disease_data [order(Disease_data [,'id'],Disease_data [,'Date_of_diagnosis']),]
Disease_data <- Disease_data [!duplicated(Disease_data $id),]
This is a simplified version, but I would have over 25 iterations of the Disease_
data frames, each with approximately 100 variables per data frame which works, but is very chunky and if possible would like to make it more succinct.
I understand that editing the data to include only the earliest participant is already very succinct, but it is the set-up for this method. Is there a way to use the startswith
command that may work? I have attempted but with no success.
答案1
得分: 3
你的数据在这里很难处理,因为它是宽格式的 - 将其转换为长格式 将更容易解决问题:
使用 tidyverse
中的 dplyr
包,我们将除了 ID
列以外的所有列都转换成名称-值对,每个都有一个独立的行。由于示例变量名具有模式 [VALUE的类型]_[顺序号]
,我们可以在 names_pattern
参数中使用正则表达式来将值名称分割成一个 name
列,该列包含测量类型(这里包括 "Disease_code" 或 "Date_of_diagnosis")和顺序号。
然后,我们将其转换为宽格式,以创建单个 Disease_code
和 Date_of_diagnosis
列,每个ID和数字组合都有一行。最后,我们将 Date_of_diagnosis
转换为日期格式(使用 lubridate::dmy
,因为它以天/月/年格式表示),以便正确排序,然后过滤掉 Disease_code 为 NA
的行:
library(tidyverse)
df2 <- df %>%
pivot_longer(-ID,
names_pattern = "(.*)_(\\d*$)",
names_to = c('name', 'num')) %>%
pivot_wider() %>%
mutate(Date_of_diagnosis = dmy(Date_of_diagnosis)) %>%
filter(!is.na(Disease_code))
df2
# A tibble: 8 × 4
ID num Disease_code Date_of_diagnosis
<dbl> <chr> <chr> <date>
1 1001 1 I802 1997-06-12
2 1001 2 A071 1998-06-12
3 1001 3 H250 2010-09-17
4 1002 1 G200 1997-06-13
5 1003 1 I802 2003-02-14
6 1003 2 G20 2001-09-18
7 1005 1 H356 2005-03-18
8 1005 2 I802 1993-07-12
现在数据格式正确,我们可以按ID分组,选择每个ID的最早日期(最小值):
df2 %>%
group_by(ID) %>%
slice_min(Date_of_diagnosis)
ID num Disease_code Date_of_diagnosis
<dbl> <chr> <chr> <date>
1 1001 1 I802 1997-06-12
2 1002 1 G200 1997-06-13
3 1003 2 G20 2001-09-18
4 1005 2 I802 1993-07-12
英文:
Your data is so hard to work with here because it's in wide form - convert it to long-form and it will be much easier to solve. :
Using the dplyr
package in tidyverse
, we take all the columns other than ID
and pivot them into name-value pairs, each with an individual row. Since the example variable names have the pattern [TYPE OF VALUE]_[SEQUENTIAL NUMBER]
we can use a regular expression in the names_pattern
argument to split the value name into a name
column which has the type of measure (here contining either "Disease_code" or "Date_of_diagnosis") and that sequential number.
Then, we pivot it wider, to make a single Disease_code
and Date_of_diagnosis
column, with 1 row for each ID and number combination. Finally we convert Date_of_diagnosis
to a Date format (using lubridate::dmy
since it's in Day/Month/Year format) so that it's ordered properly, and then filter out the rows where Disease_code is NA
, meaning there was no value there:
library(tidyverse)
df2 <- df %>%
pivot_longer(-ID,
names_pattern = "(.*)_([0-9]*$)",
names_to = c('name', 'num')) %>%
pivot_wider() %>%
mutate(Date_of_diagnosis = dmy(Date_of_diagnosis)) %>%
filter(!is.na(Disease_code))
df2
# A tibble: 8 × 4
ID num Disease_code Date_of_diagnosis
<dbl> <chr> <chr> <date>
1 1001 1 I802 1997-06-12
2 1001 2 A071 1998-06-12
3 1001 3 H250 2010-09-17
4 1002 1 G200 1997-06-13
5 1003 1 I802 2003-02-14
6 1003 2 G20 2001-09-18
7 1005 1 H356 2005-03-18
8 1005 2 I802 1993-07-12
Now that the data is properly formatted, we can just group by ID and pick the earliest date (minimum value) for each:
df2 %>%
group_by(ID) %>%
slice_min(Date_of_diagnosis)
ID num Disease_code Date_of_diagnosis
<dbl> <chr> <chr> <date>
1 1001 1 I802 1997-06-12
2 1002 1 G200 1997-06-13
3 1003 2 G20 2001-09-18
4 1005 2 I802 1993-07-12
答案2
得分: 2
A similar approach to @divibisan, but more concise. You can make use of the .value
helper so that a single pivot_longer
is enough.
library(tidyverse)
df |>
pivot_longer(-ID, names_pattern = "(.*)_(\\d+)", names_to = c(".value", "group")) |>
slice_min(dmy(Date_of_diagnosis), by = "ID", na_rm = T)
# A tibble: 4 × 4
ID group Disease_code Date_of_diagnosis
<dbl> <chr> <chr> <chr>
1 1001 1 I802 12/06/1997
2 1002 1 G200 13/06/1997
3 1003 2 G20 18/09/2001
4 1005 2 I802 12/07/1993
英文:
A similar approach to @divibisan, but more concise. You can make use of the .value
helper so that a single pivot_longer
is enough.
library(tidyverse)
df |>
pivot_longer(-ID, names_pattern = "(.*)_(\\d+)", names_to = c(".value", "group")) |>
slice_min(dmy(Date_of_diagnosis), by = "ID", na_rm = T)
# A tibble: 4 × 4
ID group Disease_code Date_of_diagnosis
<dbl> <chr> <chr> <chr>
1 1001 1 I802 12/06/1997
2 1002 1 G200 13/06/1997
3 1003 2 G20 18/09/2001
4 1005 2 I802 12/07/1993
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论