英文:
Is there a way to collapse rows based on multiple column conditions, grouped by ID in R?
问题
我有一个关于抗生素使用数据的数据框(长格式)。对于每位患者,我想要根据以下条件将特定抗生素组合成“联合疗法”:1)特定抗生素列表感兴趣,并且2)仅当这些特定抗生素具有相同的开始日期时。感兴趣的特定疗法包括:
- Ampicillin 和 ceftriaxone
- Ampicillin 和 gentamicin
- Ampicillin 和 streptomycin
我不需要其他组合(例如,如果 ampicillin 和 daptomycin 在同一天开始,我不想将它们组合在一起)。
我有一个患者ID,可以根据它将抗生素使用情况进行分组。我的 start_date 变量已经被转换为日期类(YYYY-MM-DD)。所有其他变量都是字符型。
以下是一组数据示例。在实际的数据框中,我有超过1,000行。
ID <- c("C1", "C1", "C1", "C2", "C3", "C3", "C3", "C4", "C4", "C5", "C5", "C5", "C6", "C6")
abx <- c("ampicillin", "ceftriaxone", "vancomycin", "linezolid", "ampicillin", "gentamicin", "vancomycin", "piptazo", "daptomycin", "ampicillin", "streptomycin", "vancomycin", "ampicillin", "daptomycin")
date <- c(as.Date("2020-02-01"), as.Date("2020-02-01"), as.Date("2020-01-28"), as.Date("2020-03-01"), as.Date("2020-02-17"), as.Date("2020-02-17"), as.Date("2020-03-01"), as.Date("2020-01-11"), as.Date("2020-01-05"), as.Date("2020-01-05"), as.Date("2020-01-05"), as.Date("2020-01-01"), as.Date("2020-01-14"), as.Date("2020-01-14"))
df_abx <- data.frame(ID, abx, date)
我希望的解决方案如下:
ID | abx | start_date |
---|---|---|
C1 | ampicillin, ceftriaxone | 2020-02-01 |
C1 | vancomycin | 2020-01-28 |
C2 | linezolid | 2020-03-01 |
C3 | ampicillin, gentamicin | 2020-02-17 |
C3 | vancomycin | 2020-03-01 |
C4 | piptazo | 2020-01-11 |
C4 | daptomycin | 2020-01-05 |
C5 | ampicillin, streptomycin | 2020-01-05 |
C5 | vancomycin | 2020-01-01 |
C6 | ampicillin | 2020-01-14 |
C6 | daptomycin | 2020-01-14 |
我怀疑解决方案包括使用 "group_by" 对患者ID进行分组,我已经尝试过使用 "summarize" 和 "case_when",但还没有达到我想要的结果。
英文:
I have a df of episodic antibiotic data (long form). For each patient, I’m looking to combine specific antibiotics into “combo” therapies based on 1) a specific list of antibiotics that would be of interest, and 2) ONLY if those specific antibiotics had the same start dates. The specific therapies of interest are:
- Ampicillin and ceftriaxone
- Ampicillin and gentamicin
- Ampicillin and streptomycin
I don’t need any other combinations (e.g. if ampicillin and daptomycin start on the same day, I don’t want that combined).
I have a patient ID by which the antibiotic episodes can be grouped. My start_date variable has already been coerced into date class (YYYY-MM-DD). All other variables are character.
Here’s a sample set of data. In my actual dataframe, have over 1,000 rows.
ID <- c("C1", "C1", "C1", "C2", "C3", "C3", "C3", "C4", "C4", "C5", "C5", "C5", "C6", "C6")
abx <- c("ampicillin", "ceftriaxone", "vancomycin", "linezolid", "ampicillin", "gentamicin", "vancomycin", "piptazo", "daptomycin", "ampicillin", "streptomycin", "vancomycin", "ampicillin", "daptomycin")
date <- c(as.Date("2020-02-01"), as.Date("2020-02-01"), as.Date("2020-01-28"), as.Date("2020-03-01"), as.Date("2020-02-17"), as.Date("2020-02-17"), as.Date("2020-03-01"), as.Date("2020-01-11"), as.Date("2020-01-05"), as.Date("2020-01-05"), as.Date("2020-01-05"), as.Date("2020-01-01"), as.Date("2020-01-14"), as.Date("2020-01-14"))
df_abx <- data.frame(ID, abx, date)
This is the solution I’m hoping for:
ID | abx | start_date |
---|---|---|
C1 | ampicillin, ceftriaxone | 2020-02-01 |
C1 | vancomycin | 2020-01-28 |
C2 | linezolid | 2020-03-01 |
C3 | ampicillin, gentamicin | 2020-02-17 |
C3 | vancomycin | 2020-03-01 |
C4 | piptazo | 2020-01-11 |
C4 | daptomycin | 2020-01-05 |
C5 | ampicillin, streptomycin | 2020-01-05 |
C5 | vancomycin | 2020-01-01 |
C6 | ampicillin | 2020-01-14 |
C6 | daptomycin | 2020-01-14 |
I suspect the solution includes "group_by" for the patient ID, and I've been toying around with summarize and case_when but I haven't gotten to where I want to be.
答案1
得分: 2
"Update: the correct version ** We can do it this way:
library(dplyr)
library(tidyr)
df_abx %>%
group_by(ID, date) %>%
mutate(abx = toString(abx)) %>%
anti_join(df1, by="abx") %>%
separate_rows(abx) %>%
distinct() %>%
bind_rows(df_abx %>%
group_by(ID, date) %>%
mutate(x = ifelse(str_detect(abx, pattern), 1, 0)) %>%
filter(sum(x) >= 2) %>%
summarise(abx = toString(abx))) %>%
arrange(ID, abx)
ID abx date
<chr> <chr> <date>
1 C1 ampicillin, ceftriaxone 2020-02-01
2 C1 vancomycin 2020-01-28
3 C2 linezolid 2020-03-01
4 C3 ampicillin, gentamicin 2020-02-17
5 C3 vancomycin 2020-03-01
6 C4 daptomycin 2020-01-05
7 C4 piptazo 2020-01-11
8 C5 ampicillin, streptomycin 2020-01-05
9 C5 vancomycin 2020-01-01
10 C6 ampicillin 2020-01-14
11 C6 daptomycin 2020-01-14
```"
<details>
<summary>英文:</summary>
**Update: the correct version ** We can do it this way:
library(dplyr)
library(tidyr)
df_abx %>%
group_by(ID, date) %>%
mutate(abx = toString(abx)) %>%
anti_join(df1, by="abx") %>%
separate_rows(abx) %>%
distinct() %>%
bind_rows(df_abx %>%
group_by(ID, date) %>%
mutate(x = ifelse(str_detect(abx, pattern), 1, 0)) %>%
filter(sum(x)>=2) %>%
summarise(abx= toString(abx))) %>%
arrange(ID, abx)
ID abx date
<chr> <chr> <date>
1 C1 ampicillin, ceftriaxone 2020-02-01
2 C1 vancomycin 2020-01-28
3 C2 linezolid 2020-03-01
4 C3 ampicillin, gentamicin 2020-02-17
5 C3 vancomycin 2020-03-01
6 C4 daptomycin 2020-01-05
7 C4 piptazo 2020-01-11
8 C5 ampicillin, streptomycin 2020-01-05
9 C5 vancomycin 2020-01-01
10 C6 ampicillin 2020-01-14
11 C6 daptomycin 2020-01-14
</details>
# 答案2
**得分**: 1
我们可以按照 'ID'、'date' 分组,通过使用 `case_match` 创建一个新的分组列,为所选药物创建一个共同的分组,然后添加分组,通过 `paste`(`str_c`)在 `abx` 列中汇总 'abx'。
```r
library(dplyr)
library(stringr)
df_abx %>%
group_by(ID, date) %>%
mutate(abx_grp = coalesce(case_match(abx,
c('ampicillin', 'ceftriaxone', 'gentamicin', 'streptomycin') ~ 'g1'), abx)) %>%
group_by(abx_grp, .add = TRUE) %>%
summarise(abx = str_c(abx, collapse = ", "), .groups = 'drop') %>%
select(-abx_grp)
英文:
We may group by 'ID', 'date', create a new grouping column with case_match
by making a common group for the selected drugs, and then add the grouping, to summarise the 'abx' by paste
ing (str_c
) the values in abx
library(dplyr)
library(stringr)
df_abx %>%
group_by(ID, date) %>%
mutate(abx_grp = coalesce(case_match(abx,
c('ampicillin', 'ceftriaxone', 'gentamicin', 'streptomycin') ~ 'g1'), abx)) %>%
group_by(abx_grp, .add = TRUE) %>%
summarise(abx = str_c(abx, collapse = ", "), .groups = 'drop') %>%
select(-abx_grp)
-output
# A tibble: 11 × 3
ID date abx
<chr> <date> <chr>
1 C1 2020-01-28 vancomycin
2 C1 2020-02-01 ampicillin, ceftriaxone
3 C2 2020-03-01 linezolid
4 C3 2020-02-17 ampicillin, gentamicin
5 C3 2020-03-01 vancomycin
6 C4 2020-01-05 daptomycin
7 C4 2020-01-11 piptazo
8 C5 2020-01-01 vancomycin
9 C5 2020-01-05 ampicillin, streptomycin
10 C6 2020-01-14 daptomycin
11 C6 2020-01-14 ampicillin
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论