如何在基于唯一字符列时,将多个列的事件数据合并为几行?

huangapple go评论141阅读模式
英文:

How to merge several rows with event data from multiple columns when based on unique character column?

问题

假设我有一个名为d的数据框,其中包含以下内容:

  1. > d
  2. pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
  3. 1 A 0 2018-12-31 1 2011-02-01 0 2018-12-31 0 2018-12-31
  4. 2 A 1 2013-10-31 0 2018-12-31 0 2018-12-31 1 2013-10-31
  5. 3 A 0 2018-12-31 0 2018-12-31 1 2016-09-30 0 2018-12-31
  6. 4 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
  7. 5 C 0 2018-12-31 0 2018-12-31 0 2018-12-31 1 2006-01-17
  8. 6 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 0 2018-12-31

其中,d$pnr是唯一的患者标识符,age.hlkon.hlsen.hlmix.hl表示不同的疾病及其对应的时间变量。如果在任何一列中出现1,则不能再在同一列中再次出现1。

我需要合并行,使得d$pnr只出现一次。也就是说,每列的事件数据/信息应合并到同一行中。

我在dplyr中寻找解决方案。

期望的输出

  1. pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
  2. 1 A 1 2013-10-31 1 2011-02-01 1 2016-09-30 1 2013-10-31
  3. 2 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
  4. 3 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 1 2006-01-17

数据

  1. d <- data.frame(
  2. pnr = c("A", "A", "A", "B", "C", "C"),
  3. age.hl = c(0, 1, 0, 0, 0, 1),
  4. age.hl.time = c(as.Date("2018-12-31"), as.Date("2013-10-31"), as.Date("2018-12-31"),
  5. as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2008-07-31")),
  6. kon.hl = c(1, 0, 0, 0, 0, 0),
  7. kon.hl.time = c(as.Date("2011-02-01"), as.Date("2018-12-31"), as.Date("2018-12-31"),
  8. as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2018-12-31")),
  9. sen.hl = c(0, 0, 1, 1, 0, 1),
  10. sen.hl.time = c(as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2016-09-30"),
  11. as.Date("2004-04-30"), as.Date("2018-12-31"), as.Date("2009-01-31")),
  12. mix.hl = c(0, 1, 0, 0, 1, 0),
  13. mix.hl.time = c(as.Date("2018-12-31"), as.Date("2013-10-31"), as.Date("2018-12-31"),
  14. as.Date("2018-12-31"), as.Date("2006-01-17"), as.Date("2018-12-31"))
  15. )
英文:

Let's say I have d:

  1. &gt; d
  2. pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
  3. 1 A 0 2018-12-31 1 2011-02-01 0 2018-12-31 0 2018-12-31
  4. 2 A 1 2013-10-31 0 2018-12-31 0 2018-12-31 1 2013-10-31
  5. 3 A 0 2018-12-31 0 2018-12-31 1 2016-09-30 0 2018-12-31
  6. 4 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
  7. 5 C 0 2018-12-31 0 2018-12-31 0 2018-12-31 1 2006-01-17
  8. 6 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 0 2018-12-31

d$pnr is a unique patient identifier. age.hl, kon.hl, sen.hl and mix.hl denote different diseases and their corresponding time variable. All patients are censored at 2018-12-31 if 0 event occured.

If 1 occur in any column, it cannot occur again in the same column.

I need to merge rows, so d$pnr only occur one time. I.e, event data/information from each column shall be merged to the same row.

I look for a solution in dplyr.

Expected output:

  1. pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
  2. 1 A 1 2013-10-31 1 2011-02-01 1 2016-09-30 1 2013-10-31
  3. 2 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
  4. 3 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 1 2006-01-17

Data

  1. d &lt;- data.frame(
  2. pnr = c(&quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;B&quot;, &quot;C&quot;, &quot;C&quot;),
  3. age.hl = c(0, 1, 0, 0, 0, 1),
  4. age.hl.time = c(as.Date(&quot;2018-12-31&quot;), as.Date(&quot;2013-10-31&quot;), as.Date(&quot;2018-12-31&quot;),
  5. as.Date(&quot;2018-12-31&quot;), as.Date(&quot;2018-12-31&quot;), as.Date(&quot;2008-07-31&quot;)),
  6. kon.hl = c(1, 0, 0, 0, 0, 0),
  7. kon.hl.time = c(as.Date(&quot;2011-02-01&quot;), as.Date(&quot;2018-12-31&quot;), as.Date(&quot;2018-12-31&quot;),
  8. as.Date(&quot;2018-12-31&quot;), as.Date(&quot;2018-12-31&quot;), as.Date(&quot;2018-12-31&quot;)),
  9. sen.hl = c(0, 0, 1, 1, 0, 1),
  10. sen.hl.time = c(as.Date(&quot;2018-12-31&quot;), as.Date(&quot;2018-12-31&quot;), as.Date(&quot;2016-09-30&quot;),
  11. as.Date(&quot;2004-04-30&quot;), as.Date(&quot;2018-12-31&quot;), as.Date(&quot;2009-01-31&quot;)),
  12. mix.hl = c(0, 1, 0, 0, 1, 0),
  13. mix.hl.time = c(as.Date(&quot;2018-12-31&quot;), as.Date(&quot;2013-10-31&quot;), as.Date(&quot;2018-12-31&quot;),
  14. as.Date(&quot;2018-12-31&quot;), as.Date(&quot;2006-01-17&quot;), as.Date(&quot;2018-12-31&quot;))
  15. )

答案1

得分: 2

你可以使用which.max()函数:

  1. library(dplyr)
  2. d %>%
  3. summarise(across(ends_with("time"), ~ .x[which.max(get(sub(".time", "", cur_column())))]),
  4. across(ends_with("hl"), max),
  5. .by = pnr) %>%
  6. select(names(d))
  7. # pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
  8. # 1 A 1 2013-10-31 1 2011-02-01 1 2016-09-30 1 2013-10-31
  9. # 2 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
  10. # 3 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 1 2006-01-17

注意:第一个across()函数中的值依赖于第二个across()函数中的值,所以这两个across()函数不能互换位置。

英文:

You can use which.max():

  1. library(dplyr)
  2. d %&gt;%
  3. summarise(across(ends_with(&quot;time&quot;), ~ .x[which.max(get(sub(&quot;.time&quot;, &quot;&quot;, cur_column())))]),
  4. across(ends_with(&quot;hl&quot;), max),
  5. .by = pnr) %&gt;%
  6. select(names(d))
  7. # pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
  8. # 1 A 1 2013-10-31 1 2011-02-01 1 2016-09-30 1 2013-10-31
  9. # 2 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
  10. # 3 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 1 2006-01-17

Note: The function in the first across() involves the values in the second across(), so the two across() are not exchangeable.

huangapple
  • 本文由 发表于 2023年8月9日 17:51:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76866546.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定