有没有办法根据R中的多个列条件,按ID分组来折叠行?

huangapple go评论60阅读模式
英文:

Is there a way to collapse rows based on multiple column conditions, grouped by ID in R?

问题

我有一个关于抗生素使用数据的数据框(长格式)。对于每位患者,我想要根据以下条件将特定抗生素组合成“联合疗法”:1)特定抗生素列表感兴趣,并且2)仅当这些特定抗生素具有相同的开始日期时。感兴趣的特定疗法包括:

  • Ampicillin 和 ceftriaxone
  • Ampicillin 和 gentamicin
  • Ampicillin 和 streptomycin

我不需要其他组合(例如,如果 ampicillin 和 daptomycin 在同一天开始,我不想将它们组合在一起)。

我有一个患者ID,可以根据它将抗生素使用情况进行分组。我的 start_date 变量已经被转换为日期类(YYYY-MM-DD)。所有其他变量都是字符型。

以下是一组数据示例。在实际的数据框中,我有超过1,000行。

ID <- c("C1", "C1", "C1", "C2", "C3", "C3", "C3", "C4", "C4", "C5", "C5", "C5", "C6", "C6")

abx <- c("ampicillin", "ceftriaxone", "vancomycin", "linezolid", "ampicillin", "gentamicin", "vancomycin", "piptazo", "daptomycin", "ampicillin", "streptomycin", "vancomycin", "ampicillin", "daptomycin") 

date <- c(as.Date("2020-02-01"), as.Date("2020-02-01"), as.Date("2020-01-28"), as.Date("2020-03-01"), as.Date("2020-02-17"), as.Date("2020-02-17"), as.Date("2020-03-01"), as.Date("2020-01-11"), as.Date("2020-01-05"), as.Date("2020-01-05"), as.Date("2020-01-05"), as.Date("2020-01-01"), as.Date("2020-01-14"), as.Date("2020-01-14"))

df_abx <- data.frame(ID, abx, date)

我希望的解决方案如下:

ID abx start_date
C1 ampicillin, ceftriaxone 2020-02-01
C1 vancomycin 2020-01-28
C2 linezolid 2020-03-01
C3 ampicillin, gentamicin 2020-02-17
C3 vancomycin 2020-03-01
C4 piptazo 2020-01-11
C4 daptomycin 2020-01-05
C5 ampicillin, streptomycin 2020-01-05
C5 vancomycin 2020-01-01
C6 ampicillin 2020-01-14
C6 daptomycin 2020-01-14

我怀疑解决方案包括使用 "group_by" 对患者ID进行分组,我已经尝试过使用 "summarize" 和 "case_when",但还没有达到我想要的结果。

英文:

I have a df of episodic antibiotic data (long form). For each patient, I’m looking to combine specific antibiotics into “combo” therapies based on 1) a specific list of antibiotics that would be of interest, and 2) ONLY if those specific antibiotics had the same start dates. The specific therapies of interest are:

  • Ampicillin and ceftriaxone
  • Ampicillin and gentamicin
  • Ampicillin and streptomycin

I don’t need any other combinations (e.g. if ampicillin and daptomycin start on the same day, I don’t want that combined).

I have a patient ID by which the antibiotic episodes can be grouped. My start_date variable has already been coerced into date class (YYYY-MM-DD). All other variables are character.

Here’s a sample set of data. In my actual dataframe, have over 1,000 rows.

ID &lt;- c(&quot;C1&quot;, &quot;C1&quot;, &quot;C1&quot;, &quot;C2&quot;, &quot;C3&quot;, &quot;C3&quot;, &quot;C3&quot;, &quot;C4&quot;, &quot;C4&quot;, &quot;C5&quot;, &quot;C5&quot;, &quot;C5&quot;, &quot;C6&quot;, &quot;C6&quot;)

abx &lt;- c(&quot;ampicillin&quot;, &quot;ceftriaxone&quot;, &quot;vancomycin&quot;, &quot;linezolid&quot;, &quot;ampicillin&quot;, &quot;gentamicin&quot;, &quot;vancomycin&quot;, &quot;piptazo&quot;, &quot;daptomycin&quot;, &quot;ampicillin&quot;, &quot;streptomycin&quot;, &quot;vancomycin&quot;, &quot;ampicillin&quot;, &quot;daptomycin&quot;) 

date &lt;- c(as.Date(&quot;2020-02-01&quot;), as.Date(&quot;2020-02-01&quot;), as.Date(&quot;2020-01-28&quot;), as.Date(&quot;2020-03-01&quot;), as.Date(&quot;2020-02-17&quot;), as.Date(&quot;2020-02-17&quot;), as.Date(&quot;2020-03-01&quot;), as.Date(&quot;2020-01-11&quot;), as.Date(&quot;2020-01-05&quot;), as.Date(&quot;2020-01-05&quot;), as.Date(&quot;2020-01-05&quot;), as.Date(&quot;2020-01-01&quot;), as.Date(&quot;2020-01-14&quot;), as.Date(&quot;2020-01-14&quot;))

df_abx &lt;- data.frame(ID, abx, date)

This is the solution I’m hoping for:

ID abx start_date
C1 ampicillin, ceftriaxone 2020-02-01
C1 vancomycin 2020-01-28
C2 linezolid 2020-03-01
C3 ampicillin, gentamicin 2020-02-17
C3 vancomycin 2020-03-01
C4 piptazo 2020-01-11
C4 daptomycin 2020-01-05
C5 ampicillin, streptomycin 2020-01-05
C5 vancomycin 2020-01-01
C6 ampicillin 2020-01-14
C6 daptomycin 2020-01-14

I suspect the solution includes "group_by" for the patient ID, and I've been toying around with summarize and case_when but I haven't gotten to where I want to be.

答案1

得分: 2

"Update: the correct version ** We can do it this way:

library(dplyr)
library(tidyr)

df_abx %>%
  group_by(ID, date) %>%
  mutate(abx = toString(abx)) %>%
  anti_join(df1, by="abx") %>%
  separate_rows(abx) %>%
  distinct() %>%
  bind_rows(df_abx %>%
              group_by(ID, date) %>%
              mutate(x = ifelse(str_detect(abx, pattern), 1, 0)) %>%
              filter(sum(x) >= 2) %>%
              summarise(abx = toString(abx))) %>%
  arrange(ID, abx)
 ID    abx                      date      
   <chr> <chr>                    <date>    
 1 C1    ampicillin, ceftriaxone  2020-02-01
 2 C1    vancomycin               2020-01-28
 3 C2    linezolid                2020-03-01
 4 C3    ampicillin, gentamicin   2020-02-17
 5 C3    vancomycin               2020-03-01
 6 C4    daptomycin               2020-01-05
 7 C4    piptazo                  2020-01-11
 8 C5    ampicillin, streptomycin 2020-01-05
 9 C5    vancomycin               2020-01-01
10 C6    ampicillin               2020-01-14
11 C6    daptomycin               2020-01-14
```"

<details>
<summary>英文:</summary>

**Update: the correct version ** We can do it this way: 

library(dplyr)
library(tidyr)

df_abx %>%
group_by(ID, date) %>%
mutate(abx = toString(abx)) %>%
anti_join(df1, by="abx") %>%
separate_rows(abx) %>%
distinct() %>%
bind_rows(df_abx %>%
group_by(ID, date) %>%
mutate(x = ifelse(str_detect(abx, pattern), 1, 0)) %>%
filter(sum(x)>=2) %>%
summarise(abx= toString(abx))) %>%
arrange(ID, abx)

ID abx date
<chr> <chr> <date>
1 C1 ampicillin, ceftriaxone 2020-02-01
2 C1 vancomycin 2020-01-28
3 C2 linezolid 2020-03-01
4 C3 ampicillin, gentamicin 2020-02-17
5 C3 vancomycin 2020-03-01
6 C4 daptomycin 2020-01-05
7 C4 piptazo 2020-01-11
8 C5 ampicillin, streptomycin 2020-01-05
9 C5 vancomycin 2020-01-01
10 C6 ampicillin 2020-01-14
11 C6 daptomycin 2020-01-14


</details>



# 答案2
**得分**: 1

我们可以按照 'ID'、'date' 分组,通过使用 `case_match` 创建一个新的分组列,为所选药物创建一个共同的分组,然后添加分组,通过 `paste`(`str_c`)在 `abx` 列中汇总 'abx'。

```r
library(dplyr)
library(stringr)
df_abx %>%
 group_by(ID, date) %>%
  mutate(abx_grp = coalesce(case_match(abx,
   c('ampicillin', 'ceftriaxone', 'gentamicin', 'streptomycin') ~ 'g1'), abx)) %>%
  group_by(abx_grp, .add = TRUE) %>%
 summarise(abx = str_c(abx, collapse = ", "), .groups = 'drop')  %>%
 select(-abx_grp)
英文:

We may group by 'ID', 'date', create a new grouping column with case_match by making a common group for the selected drugs, and then add the grouping, to summarise the 'abx' by pasteing (str_c) the values in abx

library(dplyr)
library(stringr)
df_abx %&gt;% 
 group_by(ID, date) %&gt;%
  mutate(abx_grp = coalesce(case_match(abx,
   c(&#39;ampicillin&#39;, &#39;ceftriaxone&#39;, &#39;gentamicin&#39;, &#39;streptomycin&#39;) ~ &#39;g1&#39;), abx)) %&gt;%
  group_by(abx_grp, .add = TRUE) %&gt;% 
 summarise(abx = str_c(abx, collapse = &quot;, &quot;), .groups = &#39;drop&#39;)  %&gt;% 
 select(-abx_grp)

-output

# A tibble: 11 &#215; 3
   ID    date       abx                     
   &lt;chr&gt; &lt;date&gt;     &lt;chr&gt;                   
 1 C1    2020-01-28 vancomycin              
 2 C1    2020-02-01 ampicillin, ceftriaxone 
 3 C2    2020-03-01 linezolid               
 4 C3    2020-02-17 ampicillin, gentamicin  
 5 C3    2020-03-01 vancomycin              
 6 C4    2020-01-05 daptomycin              
 7 C4    2020-01-11 piptazo                 
 8 C5    2020-01-01 vancomycin              
 9 C5    2020-01-05 ampicillin, streptomycin
10 C6    2020-01-14 daptomycin              
11 C6    2020-01-14 ampicillin        

huangapple
  • 本文由 发表于 2023年2月19日 04:45:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/75496297.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定