英文:
Cumulative sum based on Subject ID in R
问题
假设我们有一个名为 df
的数据框,它看起来像这样:
subjectid | event | football_year_baseline | football_total |
---|---|---|---|
1 | baseline | 3 | 6 |
1 | followup | NA |
?? |
2 | baseline | 0 | 0 |
2 | followup | NA |
?? |
3 | baseline | 2 | 4 |
我试图填写 football_total
列,为了这个示例,让我们假设在基线行中的公式是 football_year_baseline
* 2。
对于随访行,结果需要基于累积,意味着公式是基线中的 football_total
+ 2。应该使用 subjectid
和 event
来确定要将2添加到哪个基线值。
请注意:并非所有主题都有随访行。
因此,在第2行中的 football_total
将是 8 -> 6 + 2。
英文:
Assume we have dataframe df
, it looks like this:
subjectid | event | football_year_baseline | football_total |
---|---|---|---|
1 | baseline | 3 | 6 |
1 | followup | NA |
?? |
2 | baseline | 0 | 0 |
2 | followup | NA |
?? |
3 | baseline | 2 | 4 |
I'm trying to fill out the football_total
column, for purposes of this example, let's assume that the formula is football_year_baseline
* 2 in the baseline rows.
For the follow-up rows, the result needs to be cumulative based, meaning that the formula is football_total
from baseline + 2. subjectid
and event
should be used to determine which baseline value to add 2 to.
Please note: not all subjects have a follow-up row.
So, football_row
in row 2 would be 8 -> 6 + 2.
答案1
得分: 1
这应该适用于1行或多行的后续数据。它假定 "baseline"
已经是每个受试者中的第一行 - 如果不是,请先使用 arrange()
排列数据。
library(dplyr)
df |>
mutate(
fball_total = case_when(event == "baseline" ~ football_year_baseline * 2, TRUE ~ NA_integer_),
fball_total = coalesce(fball_total, fball_total[1] + 2 * (row_number() - 1)),
.by = subjectid
)
# subjectid event football_year_baseline football_total fball_total
# 1 1 baseline 3 6 6
# 2 1 followup NA NA 8
# 3 2 baseline 0 0 0
# 4 2 followup NA NA 2
# 5 3 baseline 2 4 4
英文:
This should work for 1 or more rows of follow-up. It assumes the "baseline"
is already the first row in each subject - if not, arrange()
the data first.
library(dplyr)
df |>
mutate(
fball_total = case_when(event == "baseline" ~ football_year_baseline * 2, TRUE ~ NA_integer_),
fball_total = coalesce(fball_total, fball_total[1] + 2 * (row_number() - 1)),
.by = subjectid
)
# subjectid event football_year_baseline football_total fball_total
# 1 1 baseline 3 6 6
# 2 1 followup NA NA 8
# 3 2 baseline 0 0 0
# 4 2 followup NA NA 2
# 5 3 baseline 2 4 4
答案2
得分: 1
鉴于你提到每个研究id最多只有一个后续操作,可以使用lag
的dplyr
解决方案:
library(dplyr)
x %>%
mutate(football_total = case_when(
event == "baseline" ~ football_total,
event == "followup" ~ (lag(football_total) + 2)
), .by = subjectid)
输出:
subjectid event football_year_baseline football_total
1 1 baseline 3 6
2 1 followup NA 8
3 2 baseline 0 0
4 2 followup NA 2
5 3 baseline 2 4
数据:
x <- read.table(text = "subjectid event football_year_baseline football_total
1 baseline 3 6
1 followup NA NA
2 baseline 0 0
2 followup NA NA
3 baseline 2 4", h = TRUE)
扩展示例
要在多列上执行此操作,假设采用相同的命名约定(即“xxx_total”),可以使用dplyr的across()
和contains()
。下面我添加了两列,vball_baseline
和vball_total
:
x %>%
mutate(across(contains("total"), ~ case_when(
event == "baseline" ~ .x,
event == "followup" ~ (lag(.x) + 2)
)), .by = subjectid)
扩展输出:
subjectid event football_year_baseline football_total vball_baseline vball_total
1 1 baseline 3 6 1 2
2 1 followup NA 8 NA 4
3 2 baseline 0 0 3 4
4 2 followup NA 2 NA 6
5 3 baseline 2 4 5 6
扩展数据:
x <- read.table(text = "subjectid event football_year_baseline football_total vball_baseline vball_total
1 baseline 3 6 1 2
1 followup NA NA NA NA
2 baseline 0 0 3 4
2 followup NA NA NA NA
3 baseline 2 4 5 6", h = TRUE)
英文:
Since you mention there is at most one single follow up for each study id, an alternative dplyr
solution using lag
:
library(dplyr)
x %>%
mutate(football_total = case_when(
event == "baseline" ~ football_total,
event == "followup" ~ (lag(football_total) + 2)
), .by = subjectid)
Output:
subjectid event football_year_baseline football_total
1 1 baseline 3 6
2 1 followup NA 8
3 2 baseline 0 0
4 2 followup NA 2
5 3 baseline 2 4
Data
x <- read.table(text = "subjectid event football_year_baseline football_total
1 baseline 3 6
1 followup NA NA
2 baseline 0 0
2 followup NA NA
3 baseline 2 4", h = TRUE)
Extended example
To do this across multiple columns, assuming the same nomenclature convention (i.e., "xxx_total"), you can use dplyr's across()
and contains()
. Below I added two columns, vball_baseline
and vball_total
:
x %>%
mutate(across(contains("total"), ~ case_when(
event == "baseline" ~ .x,
event == "followup" ~ (lag(.x) + 2)
)), .by = subjectid)
Extended Output
subjectid event football_year_baseline football_total vball_baseline vball_total
1 1 baseline 3 6 1 2
2 1 followup NA 8 NA 4
3 2 baseline 0 0 3 4
4 2 followup NA 2 NA 6
5 3 baseline 2 4 5 6
Extended Data:
x <- read.table(text = "subjectid event football_year_baseline football_total vball_baseline vball_total
1 baseline 3 6 1 2
1 followup NA NA NA NA
2 baseline 0 0 3 4
2 followup NA NA NA NA
3 baseline 2 4 5 6", h = TRUE)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论